I am using SAS 9.4 and have a dataset that has multiple observations for each ID. For whatever reason, some of these individuals have observations which have missing values for variables (i.e., sex, race, ethnicity) that are present in another observation for the same ID.
See below as an example:
data sample;
input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 1
3 . . . 0
3 0 0 1 0
3 0 0 1 1
;
RUN;
What I am trying to figure out is how to write code that will replace missing values for specific variables (i.e. sex, race, ethnicity) wherever there is a complete value for another observation for the same ID. Of note, there are some variables that may have differing values by observation (see outcome variable) that I do not want to be changed.
Thank you for your help!
As I understand it, for each variable (SEX, RACE, ETH) you want to replace instances of missing values with whatever non-missing value is found for that variable and ID. This assumes that no ID has more than one distinct non-missing value for each of those variables.
Pass through each ID twice, first to save non-missing values, and the second to retrieved those saved values:
data sample;
input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 1
3 . . . 0
3 0 0 1 0
3 0 0 1 1
RUN;
data want (drop=i);
set sample (in=firstpass) sample (in=secondpass);
by id;
array tmp {3} _temporary_;
array var {3} sex race eth;
if first.id then call missing(of tmp{*});
if firstpass then do i=1 to 3;
tmp{i}=coalesce(tmp{i},var{i});
end;
if secondpass;
do i=1 to 3;
var{i}=tmp{i};
end;
run;
Another, more compact mode, is through use of a MERGE statement, with a merge dataset argument for each variable that needs this type of update. No other code is needed:
data want2;
merge sample (drop=sex race eth)
sample (keep=id sex where=(sex^=.))
sample (keep=id race where=(race^=.))
sample (keep=id eth where=(eth^=.));
by id;
run;
And if you have, say 20 variables needing this treatment, you can avoid naming the SAMPLE dataset 21 times in the merge statement. You can use the UPDATE statement to prepare one obs per ID, yielding the latest non-missing values for each ID (in dataset NEED). Then merge with the original dataset:
data need/view=need;
update sample (keep=id sex race eth obs=0)
sample (keep=id sex race eth);
by id;
run;
data want3;
merge sample (drop=sex race eth)
need;
by id;
run;
Note that NEED is a dataset VIEW, not a dataset FILE. That means it isn't actually executed until NEED is accessed in a later step. As a result, SAMPLE is simultaneously accessed to satisfy the DATA NEED step and the DATA WANT step, minimizing the burden on disk retrieval in the case of large datasets.
As I understand it, for each variable (SEX, RACE, ETH) you want to replace instances of missing values with whatever non-missing value is found for that variable and ID. This assumes that no ID has more than one distinct non-missing value for each of those variables.
Pass through each ID twice, first to save non-missing values, and the second to retrieved those saved values:
data sample;
input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 1
3 . . . 0
3 0 0 1 0
3 0 0 1 1
RUN;
data want (drop=i);
set sample (in=firstpass) sample (in=secondpass);
by id;
array tmp {3} _temporary_;
array var {3} sex race eth;
if first.id then call missing(of tmp{*});
if firstpass then do i=1 to 3;
tmp{i}=coalesce(tmp{i},var{i});
end;
if secondpass;
do i=1 to 3;
var{i}=tmp{i};
end;
run;
Another, more compact mode, is through use of a MERGE statement, with a merge dataset argument for each variable that needs this type of update. No other code is needed:
data want2;
merge sample (drop=sex race eth)
sample (keep=id sex where=(sex^=.))
sample (keep=id race where=(race^=.))
sample (keep=id eth where=(eth^=.));
by id;
run;
And if you have, say 20 variables needing this treatment, you can avoid naming the SAMPLE dataset 21 times in the merge statement. You can use the UPDATE statement to prepare one obs per ID, yielding the latest non-missing values for each ID (in dataset NEED). Then merge with the original dataset:
data need/view=need;
update sample (keep=id sex race eth obs=0)
sample (keep=id sex race eth);
by id;
run;
data want3;
merge sample (drop=sex race eth)
need;
by id;
run;
Note that NEED is a dataset VIEW, not a dataset FILE. That means it isn't actually executed until NEED is accessed in a later step. As a result, SAMPLE is simultaneously accessed to satisfy the DATA NEED step and the DATA WANT step, minimizing the burden on disk retrieval in the case of large datasets.
Thank you for the multiple options! I started with the middle "compact" code and although I did list out the 10+ variables that worked fine.
Thank you for the quick response.
Use a double DO loop:
data sample;
input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 1
3 . . . 0
3 0 0 1 0
3 0 0 1 1
;
data want;
do until (last.id);
set sample;
by id;
_s = coalesce(_s,sex);
_r = coalesce(_r,race);
_e = coalesce(_e,eth);
end;
do until (last.id);
set sample;
by id;
sex = _s;
race = _r;
eth = _e;
output;
end;
drop _:;
run;
But since this makes the three variables redundant, you should consider to put them in their own dataset with only one obs per id. Join/merge them in whenever needed.
If the last non-missing value for each ID is acceptable then you can use the UPDATE statement to collapse each variable to last non-missing value. To reload the detailed values of the other variables re-read those observations. You can use a pair of DO loops to do it in one data step.
First let's change the sample data to have a missing value in one of the variables you want excluded from the process, let's pick the second OUTCOME value for ID=2.
data sample;
input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 .
3 . . . 0
3 0 0 1 0
3 0 0 1 1
;
The first loop uses the UPDATE to statement to gather the last non-missing value. Basically it treats a empty version of dataset as the original dataset for which transactions are being applied and the whole dataset as the set of transactions. Then in the second DO loop use the SET statement to re-read the other variables and the OUTPUT statement to generate every observation for this value of ID.
You will need to list either the variables that need updating or those that don't.
For example you could use a DROP= dataset option to exclude the variables we wanted to "fix" from being reloaded by the SET statement.
data want;
do until(last.id);
update sample(obs=0) sample;
by id;
end;
do until(last.id);
set sample(drop=sex race eth);
by id;
output;
end;
run;
Results:
Obs id sex race eth outcome
1 1 0 2 1 0
2 1 0 2 1 1
3 2 1 1 0 1
4 2 1 1 0 .
5 3 0 0 1 0
6 3 0 0 1 0
7 3 0 0 1 1
If the list of variables to exclude is smaller than the list to include you might change the SET statement to use KEEP= instead. Make sure to keep the ID variable so the BY statement will work.
set sample(keep=id outcome);
data sample;
input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 1
3 . . . 0
3 0 0 1 0
3 0 0 1 1
;
proc sql;
create table want as
select id,
coalesce(sex,max(sex)) as sex,
coalesce(race,max(race)) as race,
coalesce(eth,max(eth)) as eth,
outcome
from sample
group by id;
quit;
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.