BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Amihic
Fluorite | Level 6

Hi Everyone!

 

I have data that looks like this:

 

ID     BYR    HGC1979    HGC1980    HGC1981   HGC1982   HGC1983   HGC1984   HGC1986   HGC1988    

01    1975         12                12                12                 12              12                 .                12               12         

02    1982         11                12                  .                  14              15                 .                  .                 16        

03    1978          9                  9                  9                   9                9                 9                 9                  9         

04    1987          7                  8                  9                  10              11               12               13                14        

n      1991          6                  7                  8                    .                 .                 9                 9                  9         

 

(n = ~10,000)

(BYR = birth year of subject)

(HGC## = highest grade of school completed by mother of subject in year ##. Ex. HGC79 is the highest grade completed in year 1979)

 (. indicates years that participates did not participate in the survey, but were still enrolled in the cohort)

(Births years range from 1970 to 2000, however the HCG variable was only measured in years 1979-1988. Thus, for births occurring before 1979, I want to match them with the HCG1979 and for births occurring after 1988, I want to match them with HCG1988.)

 

I want the data to look like this:

 

ID    BYR     HGCB

01   1975        12

02   1982        14

03   1978          9

04   1987        13

n     1991          9                 

 

(HGCB = highest grade completed by mother of subject reported the year the subject was born)

 

I am using SAS 9.4

 

Any help would be much appreciated!

 

Regards,

AMIHIC

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

Well, it would be simple enough to add at the end:

 

if cyrb < 1979 and hgcb = . then hgcb = hgc1979;

 

But just to explore a bit, what if hgc1979 has a missing value?  Would you want to use hgc1980 in that case?

 

You could conceivably use:

 

if cyrb < 1979 and hgcb = . then

do _n_=1979 to 1988 until (hgcb > .);

   hgcb = hgc{_n_};

end;

 

 

View solution in original post

16 REPLIES 16
Reeza
Super User

VVALUEX() function can retrieve the value. Use CATS to build a variable name that looks like HGC + Last two digits of the year.

 

x=put(byr, 4.); *convert number to character;
Want = VVALUEX(cats('HGC', substr(x, 3, 2)));

@Amihic wrote:

Hi Everyone!

 

I have data that looks like this:

 

ID     BYR    HGC79    HGC80    HGC81   HGC82   HGC83   HGC84   HGC86   HGC88    HGC90 

01    1975         12            12            12            12          12           .              12            12            12 

02    1982         11            12             .              14          15            .                .           16            16

03    1978          9              9              9              9            9            9               9             9              9

04    1987          7              8              9             10          11          12             13           14            15

n      1991          6              7              8               .           .              9              9              9              9

 

(n = ~10,000)

(BYR = birth year of subject)

(HGC## = highest grade of school completed by mother of subject in year ##. Ex. HGC79 is the highest grade completed in year 1979)

 

and I want it to look like this:

 

ID    BYR     HGCB

01   1975        12

02   1982        14

03   1978          9

04   1987        13

n     1991          9                 

 

(HGCB = highest grade completed by mother of subject reported the year the subject was born)

 

I am using SAS 9.4

 

Any help would be much appreciated!

 

Regards,

AMIHIC

 

 


 

Astounding
PROC Star

Are some years missing from your data on purpose, or is that just an oversight?

 

Assuming that you really have data for all years in the range of 1979 to 1990,

 

Assuming that you do have some missing values and want to use whatever is the most recent value for which you have data,

 

Assuming that speed is not as important as clarity and understanding what the program does .....

 

data want;

set have;

array hgc {79:90} hgc79 - hgc90;

limit = mod(byr, 100);

if (79 <= limit <= 90) then do _n_=79 to limit;

   hgcb = max(hgcb, hgc{limit};

end;

drop limit;

run;

 

Depending on what is really in your data, there are ways to relax some of the assumptions.  For example, this might be a more fitting DO loop:

 

if (79 <= limit) then do _n_ = 79 to min(limit, 90);

 

That would account for the case where BYR=1991, but there is only data going up to 1990.

Amihic
Fluorite | Level 6

 

Hi Astounding!

 

You're code is very helpful! It gets me very close to the solution I need, but I need some more help to tweak it further to address the complexity of the missing values.

 

I adapted your code to work with my data set to this:

 

DATA HGCB3

   set HGCBdata;
   array HGC {1979:1988} HGC1979 - HGC1988;
   limit = mod(CYRB, 10000);
   if (1979 <= limit) then do _n_ = 1979 to min(limit, 1988);
   HGCB = max(hgcb, HGC{limit});
   end;
   drop limit;
RUN;

 

An example of the data set produced by the above code is:

 

ID   CYRB    HGC1979   HGC1980  HGC1981  HGC1982  HGC1983  HGC1984  HGC1986  HGC1988  HGCB  

01   1979           12              12               12              12               12               12            12            12              12

02   1981             9              10               11              12               12               12             12            12             11   

03   1976             8               9                10              10               10               10              .               .                .

04   1984             10             11               12               .                  .                  .               .               12             .

05   1987              .               .                   13             13               13               13            13             13             .

n     1973             .                11                 12             13              14              14              14             14            .

 

However, I would like to address the missing values so that the data set produced looks like this: 

 

ID   CYRB    HGC1979   HGC1980  HGC1981  HGC1982  HGC1983  HGC1984  HGC1986  HGC1988  HGCB  

01   1979           12              12               12              12               12               12            12            12              12

02   1981             9              10               11              12               12               12             12            12             11   

03   1976             8               9                10              10               10               10              .               .                8*

04   1984            10             11               12               .                  .                  .               .               14             12*

05   1987              .               .                 13              13               13               13            13             13             13*

n     1973             .                11               12             13                14              14              14            14            11*

 

As you can see from the HGCB data points labeled with an *, I would like to avoid having missing values and instead,

populate HGCB with the closest previous non-missing HGCyear value. 

 

Also of importance, (I was mistaken about this in my previous request) there are no participants born after 1988, meaning that there are no CYBR values >1988. However there are participants born before 1979, meaning that there are a number of participants with CYRB value <1979. Thus, the IF statement (which is currently if (1979 <= limit) then do _n_ = 1979 to min(limit, 2014);) needs to be updated, and I am not sure how to do that.                       

 

 Thank you again for your help!

 

AMIHIC

Astounding
PROC Star

It looks like you are comfortable modifying slightly to get to this point:

 

if (1979 <= limit) then do _n_ = 1979 to min(limit, 1988);

 

I think you get closer (all the way there??) if you go with:

 

if (limit > .) then do _n_ = max(limit, 1979) to min (limit, 1988);

Amihic
Fluorite | Level 6

Hi Astounding,

 

Thank you for the updated code. Unfortunately when I ran it, it yields exactly the same data set as that created by the previous code you suggested. 

Specifically, HGCB is wrongly populated as . (missing) in situations when I would instead like it be populated with the HGC value from the closest corresponding year.

 

AMIHIC

 

 

 

 

Astounding
PROC Star

Try adding a second loop.  So after the END but before the DROP:

 

if hgbc = . and limit > 1979 then do _n_ = limit - 1 to 1979 by -1 until (hgbc > .);

   hgbc = hgc{limit};

end;

Amihic
Fluorite | Level 6

Hi Astounding,

 

I incorporated the new code you suggested. It now looks like this:

 

DATA want;
  set have;
            array HGC {1979:1988} HGC1979 - HGC1988;
               limit = mod(CYRB, 10000); *mod returns the remainder of the division;
               if (limit > .) then do _n_ = max(limit, 1979) to min (limit, 1988);
               HGCB = max(HGCB, HGC{limit});
             end;
               if HGCB = . and limit > 1979 then do _n_ = limit - 1 to 1979 by -1 until (HGCB > .);
               HGCB = HGC{limit};
             end;
   drop limit;
RUN;

 

Unfortunately it yields the same data set as previously, where HGCB is populated by . (missing) in cases where I would instead want the closest previous HGCyear value. Perhaps the log gives a clue to the issue.

 

When I run it, the log writes:

NOTE: Missing values were generated as a result of performing an operation on missing values.
Each place is given by: (Number of times) at (Line):(Column).
1134 at 134:11

 

 

I wonder if I need to include some option that tells SAS what to do when it encounters a missing value or when it expects to find an HGCyear variable that doesn't exist, for instance, the var HGC1987.

 

Thanks again for any help,

AMIHIC

Astounding
PROC Star

You added the code correctly.  Show a sample line of data that is not being handled properly (including what the result was, and what you expected it to be).  

Amihic
Fluorite | Level 6

Here are the examples:

 

Example 1: a situation where the data point for the HGC year is missing due to the participant not participating in the survey that year, resulting in a missing HGCB value. I would like the HGCB variable to be populated with a non-missing HGCyear value that most closely precedes the missing year.

 

Sample line of what the data set originally looked like:

ID        BYR      HGC1979    HGC1980    HCG1981   HCG1982  HGC1984  HGC1986   HGC1988   

301     1981          10                  10                .                   .                 11              12               12

 

Sample line of data handled incorrectly:

ID        BYR      HGC1979    HGC1980    HCG1981   HCG1982   HGC1984  HGC1986   HGC1988    HGCB

301     1981          10                  10                .                   .                 11              12               12              .

 

Sample line of what I wanted it to be:

ID        BYR      HGC1979    HGC1980    HCG1981   HCG1982   HGC1984  HGC1986   HGC1988    HGCB

301     1981          10                  10                .                   .                 11              12               12              10

 

Example 2: a situation where there is no HGC value for the BYR year because the survey was not conducted in that year, and therefore the HGC variable for that year does not exist in the data set. I would like the HGCB variable to be populated with a non-missing HGCyear value that most closely precedes the missing year. 

 

Sample line of what the data set originally looked like:

ID        BYR      HGC1979    HGC1980    HCG1981   HCG1982   HGC1984    HGC1986     HGC1988   

403     1987          9                    .                    9                   9                 13             14                 14             

 

Sample line of data handled incorrectly:

ID        BYR      HGC1979    HGC1980    HCG1981   HCG1982   HGC1984    HGC1986     HGC1988   HGCB

403     1987          9                    .                    9                   9                 13             14                 14             .

 

Sample line of what I wanted it to be:

ID        BYR      HGC1979    HGC1980    HCG1981   HCG1982   HGC1984    HGC1986     HGC1988   HGCB

403     1987          9                    .                    9                   9                 13             14                 14             14

 

 

Astounding
PROC Star

Looking at the code, is CYRB the birth year?  Since you switched to four-digit years throughout, we can eliminate some of the complications and see if any problems remain:

 

DATA want;
  set have;
            array HGC {1979:1988} HGC1979 - HGC1988;
               if (CYRB > .) then do _n_ = max(CYRB, 1979) to min (CYRB, 1988);
               HGCB = max(HGCB, HGC{_n_});   **** possible that HGC{limit} should have been HGC{_n_} in 2 places ;
             end;
               if HGCB = . and CYRB > 1979 then do _n_ = CYRB - 1 to 1979 by -1 until (HGCB > .);
               HGCB = HGC{_n_};
             end;
RUN;

Amihic
Fluorite | Level 6

Hi Astounding!

 

The code you wrote corrected the issue! Fantastic! 

 

There is just one final situation where the HGCB variable outputs as missing when it shouldn't.

 

Example: in this case, the birth year variable (CYRB) is prior to 1979 (there are in fact a number of cases where the CYRB is between 1970 and 1979), however there are no HGCyear vars prior to 1979 because the survey only began in 1979. I would like the HGCB variable to be populated with the HGC1979 value. Note: every observation with a birth year (CYRB) <1979 has a non-missing value for HGC1979. 

 

Sample line of what the data set originally looked like:

ID        CYRB      HGC1979    HGC1980    HCG1981   HCG1982   HGC1984    HGC1986     HGC1988   

506     1976          9                    9                    10                   10                 10             10                 10             

 

Sample line of data handled incorrectly:

ID        CYRB      HGC1979    HGC1980    HCG1981   HCG1982   HGC1984    HGC1986     HGC1988   HGCB

506     1976          9                    9                    10                   10                 10             10                 10             .

 

Sample line of what I wanted it to be:

ID        CYRB      HGC1979    HGC1980    HCG1981   HCG1982   HGC1984    HGC1986     HGC1988   HGCB

506     1976              9                  9                 10                10                10                 10             10              9

 

 

AMIHIC

Astounding
PROC Star

Well, it would be simple enough to add at the end:

 

if cyrb < 1979 and hgcb = . then hgcb = hgc1979;

 

But just to explore a bit, what if hgc1979 has a missing value?  Would you want to use hgc1980 in that case?

 

You could conceivably use:

 

if cyrb < 1979 and hgcb = . then

do _n_=1979 to 1988 until (hgcb > .);

   hgcb = hgc{_n_};

end;

 

 

Amihic
Fluorite | Level 6

Hi Astounding,

 

Thank you so much for helping me solve this problem. The solution, in full, is:

 

DATA want;
    set have;
         array HGC {1979:1989} HGC1979 - HGC1989;
                   if (CYRB > .) then do _n_ = max(CYRB, 1979) to min (CYRB, 1989);
                         HGCB = max(HGCB, HGC{_n_}); 
                   end; 
                   if HGCB = . and CYRB > 1979 then do _n_ = CYRB - 1 to 1979 by -1 until (HGCB > .);
                         HGCB = HGC{_n_};
                  end;
          if CYRB < 1979 and HGCB = . then HGCB = HGC1979;
RUN;

 

To respond to your additional exploration, it was unnecessary for me to use the additional code you included at the bottom of your last answer because every participant with a date of birth prior to 1979 has a non-missing HGC1979 value. 1979 was the first survey year, and all participants already born (meaning they have a CYRB <1979) completed the survey in 1979.  

 

Thank you so much for all your help!! Now I can finally get on to the analysis!

 

Best,

Alanna 

ballardw
Super User

If any of your BYR values are 2000 or later and you are using HGC00 to HGC18 welcome to Y2K as you just greatly complicated your code by not using 4 digit years.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 765 views
  • 0 likes
  • 4 in conversation