Hi Everyone!
I have data that looks like this:
ID BYR HGC1979 HGC1980 HGC1981 HGC1982 HGC1983 HGC1984 HGC1986 HGC1988
01 1975 12 12 12 12 12 . 12 12
02 1982 11 12 . 14 15 . . 16
03 1978 9 9 9 9 9 9 9 9
04 1987 7 8 9 10 11 12 13 14
n 1991 6 7 8 . . 9 9 9
(n = ~10,000)
(BYR = birth year of subject)
(HGC## = highest grade of school completed by mother of subject in year ##. Ex. HGC79 is the highest grade completed in year 1979)
(. indicates years that participates did not participate in the survey, but were still enrolled in the cohort)
(Births years range from 1970 to 2000, however the HCG variable was only measured in years 1979-1988. Thus, for births occurring before 1979, I want to match them with the HCG1979 and for births occurring after 1988, I want to match them with HCG1988.)
I want the data to look like this:
ID BYR HGCB
01 1975 12
02 1982 14
03 1978 9
04 1987 13
n 1991 9
(HGCB = highest grade completed by mother of subject reported the year the subject was born)
I am using SAS 9.4
Any help would be much appreciated!
Regards,
AMIHIC
Well, it would be simple enough to add at the end:
if cyrb < 1979 and hgcb = . then hgcb = hgc1979;
But just to explore a bit, what if hgc1979 has a missing value? Would you want to use hgc1980 in that case?
You could conceivably use:
if cyrb < 1979 and hgcb = . then
do _n_=1979 to 1988 until (hgcb > .);
hgcb = hgc{_n_};
end;
VVALUEX() function can retrieve the value. Use CATS to build a variable name that looks like HGC + Last two digits of the year.
x=put(byr, 4.); *convert number to character;
Want = VVALUEX(cats('HGC', substr(x, 3, 2)));
@Amihic wrote:
Hi Everyone!
I have data that looks like this:
ID BYR HGC79 HGC80 HGC81 HGC82 HGC83 HGC84 HGC86 HGC88 HGC90
01 1975 12 12 12 12 12 . 12 12 12
02 1982 11 12 . 14 15 . . 16 16
03 1978 9 9 9 9 9 9 9 9 9
04 1987 7 8 9 10 11 12 13 14 15
n 1991 6 7 8 . . 9 9 9 9
(n = ~10,000)
(BYR = birth year of subject)
(HGC## = highest grade of school completed by mother of subject in year ##. Ex. HGC79 is the highest grade completed in year 1979)
and I want it to look like this:
ID BYR HGCB
01 1975 12
02 1982 14
03 1978 9
04 1987 13
n 1991 9
(HGCB = highest grade completed by mother of subject reported the year the subject was born)
I am using SAS 9.4
Any help would be much appreciated!
Regards,
AMIHIC
Are some years missing from your data on purpose, or is that just an oversight?
Assuming that you really have data for all years in the range of 1979 to 1990,
Assuming that you do have some missing values and want to use whatever is the most recent value for which you have data,
Assuming that speed is not as important as clarity and understanding what the program does .....
data want;
set have;
array hgc {79:90} hgc79 - hgc90;
limit = mod(byr, 100);
if (79 <= limit <= 90) then do _n_=79 to limit;
hgcb = max(hgcb, hgc{limit};
end;
drop limit;
run;
Depending on what is really in your data, there are ways to relax some of the assumptions. For example, this might be a more fitting DO loop:
if (79 <= limit) then do _n_ = 79 to min(limit, 90);
That would account for the case where BYR=1991, but there is only data going up to 1990.
Hi Astounding!
You're code is very helpful! It gets me very close to the solution I need, but I need some more help to tweak it further to address the complexity of the missing values.
I adapted your code to work with my data set to this:
DATA HGCB3
set HGCBdata;
array HGC {1979:1988} HGC1979 - HGC1988;
limit = mod(CYRB, 10000);
if (1979 <= limit) then do _n_ = 1979 to min(limit, 1988);
HGCB = max(hgcb, HGC{limit});
end;
drop limit;
RUN;
An example of the data set produced by the above code is:
ID CYRB HGC1979 HGC1980 HGC1981 HGC1982 HGC1983 HGC1984 HGC1986 HGC1988 HGCB
01 1979 12 12 12 12 12 12 12 12 12
02 1981 9 10 11 12 12 12 12 12 11
03 1976 8 9 10 10 10 10 . . .
04 1984 10 11 12 . . . . 12 .
05 1987 . . 13 13 13 13 13 13 .
n 1973 . 11 12 13 14 14 14 14 .
However, I would like to address the missing values so that the data set produced looks like this:
ID CYRB HGC1979 HGC1980 HGC1981 HGC1982 HGC1983 HGC1984 HGC1986 HGC1988 HGCB
01 1979 12 12 12 12 12 12 12 12 12
02 1981 9 10 11 12 12 12 12 12 11
03 1976 8 9 10 10 10 10 . . 8*
04 1984 10 11 12 . . . . 14 12*
05 1987 . . 13 13 13 13 13 13 13*
n 1973 . 11 12 13 14 14 14 14 11*
As you can see from the HGCB data points labeled with an *, I would like to avoid having missing values and instead,
populate HGCB with the closest previous non-missing HGCyear value.
Also of importance, (I was mistaken about this in my previous request) there are no participants born after 1988, meaning that there are no CYBR values >1988. However there are participants born before 1979, meaning that there are a number of participants with CYRB value <1979. Thus, the IF statement (which is currently if (1979 <= limit) then do _n_ = 1979 to min(limit, 2014);) needs to be updated, and I am not sure how to do that.
Thank you again for your help!
AMIHIC
It looks like you are comfortable modifying slightly to get to this point:
if (1979 <= limit) then do _n_ = 1979 to min(limit, 1988);
I think you get closer (all the way there??) if you go with:
if (limit > .) then do _n_ = max(limit, 1979) to min (limit, 1988);
Hi Astounding,
Thank you for the updated code. Unfortunately when I ran it, it yields exactly the same data set as that created by the previous code you suggested.
Specifically, HGCB is wrongly populated as . (missing) in situations when I would instead like it be populated with the HGC value from the closest corresponding year.
AMIHIC
Try adding a second loop. So after the END but before the DROP:
if hgbc = . and limit > 1979 then do _n_ = limit - 1 to 1979 by -1 until (hgbc > .);
hgbc = hgc{limit};
end;
Hi Astounding,
I incorporated the new code you suggested. It now looks like this:
DATA want;
set have;
array HGC {1979:1988} HGC1979 - HGC1988;
limit = mod(CYRB, 10000); *mod returns the remainder of the division;
if (limit > .) then do _n_ = max(limit, 1979) to min (limit, 1988);
HGCB = max(HGCB, HGC{limit});
end;
if HGCB = . and limit > 1979 then do _n_ = limit - 1 to 1979 by -1 until (HGCB > .);
HGCB = HGC{limit};
end;
drop limit;
RUN;
Unfortunately it yields the same data set as previously, where HGCB is populated by . (missing) in cases where I would instead want the closest previous HGCyear value. Perhaps the log gives a clue to the issue.
When I run it, the log writes:
NOTE: Missing values were generated as a result of performing an operation on missing values.
Each place is given by: (Number of times) at (Line):(Column).
1134 at 134:11
I wonder if I need to include some option that tells SAS what to do when it encounters a missing value or when it expects to find an HGCyear variable that doesn't exist, for instance, the var HGC1987.
Thanks again for any help,
AMIHIC
You added the code correctly. Show a sample line of data that is not being handled properly (including what the result was, and what you expected it to be).
Here are the examples:
Example 1: a situation where the data point for the HGC year is missing due to the participant not participating in the survey that year, resulting in a missing HGCB value. I would like the HGCB variable to be populated with a non-missing HGCyear value that most closely precedes the missing year.
Sample line of what the data set originally looked like:
ID BYR HGC1979 HGC1980 HCG1981 HCG1982 HGC1984 HGC1986 HGC1988
301 1981 10 10 . . 11 12 12
Sample line of data handled incorrectly:
ID BYR HGC1979 HGC1980 HCG1981 HCG1982 HGC1984 HGC1986 HGC1988 HGCB
301 1981 10 10 . . 11 12 12 .
Sample line of what I wanted it to be:
ID BYR HGC1979 HGC1980 HCG1981 HCG1982 HGC1984 HGC1986 HGC1988 HGCB
301 1981 10 10 . . 11 12 12 10
Example 2: a situation where there is no HGC value for the BYR year because the survey was not conducted in that year, and therefore the HGC variable for that year does not exist in the data set. I would like the HGCB variable to be populated with a non-missing HGCyear value that most closely precedes the missing year.
Sample line of what the data set originally looked like:
ID BYR HGC1979 HGC1980 HCG1981 HCG1982 HGC1984 HGC1986 HGC1988
403 1987 9 . 9 9 13 14 14
Sample line of data handled incorrectly:
ID BYR HGC1979 HGC1980 HCG1981 HCG1982 HGC1984 HGC1986 HGC1988 HGCB
403 1987 9 . 9 9 13 14 14 .
Sample line of what I wanted it to be:
ID BYR HGC1979 HGC1980 HCG1981 HCG1982 HGC1984 HGC1986 HGC1988 HGCB
403 1987 9 . 9 9 13 14 14 14
Looking at the code, is CYRB the birth year? Since you switched to four-digit years throughout, we can eliminate some of the complications and see if any problems remain:
DATA want;
set have;
array HGC {1979:1988} HGC1979 - HGC1988;
if (CYRB > .) then do _n_ = max(CYRB, 1979) to min (CYRB, 1988);
HGCB = max(HGCB, HGC{_n_}); **** possible that HGC{limit} should have been HGC{_n_} in 2 places ;
end;
if HGCB = . and CYRB > 1979 then do _n_ = CYRB - 1 to 1979 by -1 until (HGCB > .);
HGCB = HGC{_n_};
end;
RUN;
Hi Astounding!
The code you wrote corrected the issue! Fantastic!
There is just one final situation where the HGCB variable outputs as missing when it shouldn't.
Example: in this case, the birth year variable (CYRB) is prior to 1979 (there are in fact a number of cases where the CYRB is between 1970 and 1979), however there are no HGCyear vars prior to 1979 because the survey only began in 1979. I would like the HGCB variable to be populated with the HGC1979 value. Note: every observation with a birth year (CYRB) <1979 has a non-missing value for HGC1979.
Sample line of what the data set originally looked like:
ID CYRB HGC1979 HGC1980 HCG1981 HCG1982 HGC1984 HGC1986 HGC1988
506 1976 9 9 10 10 10 10 10
Sample line of data handled incorrectly:
ID CYRB HGC1979 HGC1980 HCG1981 HCG1982 HGC1984 HGC1986 HGC1988 HGCB
506 1976 9 9 10 10 10 10 10 .
Sample line of what I wanted it to be:
ID CYRB HGC1979 HGC1980 HCG1981 HCG1982 HGC1984 HGC1986 HGC1988 HGCB
506 1976 9 9 10 10 10 10 10 9
AMIHIC
Well, it would be simple enough to add at the end:
if cyrb < 1979 and hgcb = . then hgcb = hgc1979;
But just to explore a bit, what if hgc1979 has a missing value? Would you want to use hgc1980 in that case?
You could conceivably use:
if cyrb < 1979 and hgcb = . then
do _n_=1979 to 1988 until (hgcb > .);
hgcb = hgc{_n_};
end;
Hi Astounding,
Thank you so much for helping me solve this problem. The solution, in full, is:
DATA want;
set have;
array HGC {1979:1989} HGC1979 - HGC1989;
if (CYRB > .) then do _n_ = max(CYRB, 1979) to min (CYRB, 1989);
HGCB = max(HGCB, HGC{_n_});
end;
if HGCB = . and CYRB > 1979 then do _n_ = CYRB - 1 to 1979 by -1 until (HGCB > .);
HGCB = HGC{_n_};
end;
if CYRB < 1979 and HGCB = . then HGCB = HGC1979;
RUN;
To respond to your additional exploration, it was unnecessary for me to use the additional code you included at the bottom of your last answer because every participant with a date of birth prior to 1979 has a non-missing HGC1979 value. 1979 was the first survey year, and all participants already born (meaning they have a CYRB <1979) completed the survey in 1979.
Thank you so much for all your help!! Now I can finally get on to the analysis!
Best,
Alanna
If any of your BYR values are 2000 or later and you are using HGC00 to HGC18 welcome to Y2K as you just greatly complicated your code by not using 4 digit years.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.