BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
sasgorilla
Pyrite | Level 9

I am using SAS 9.4 and have  a dataset that has multiple observations for each ID. For whatever reason, some of these individuals have observations which have missing values for variables (i.e., sex, race, ethnicity) that are present in another observation for the same ID. 

See below as an example: 

data sample;
input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 1
3 . . . 0
3 0 0 1 0
3 0 0 1 1
;
RUN;

What I am trying to figure out is how to write code that will replace missing values for specific variables (i.e. sex, race, ethnicity) wherever there is a complete value for another observation for the same ID.  Of note, there are some variables that may have differing values by observation (see outcome variable) that I do not want to be changed.

 

Thank you for your help!

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

As I understand it, for each variable (SEX, RACE, ETH) you want to replace instances of missing values with whatever non-missing value is found for that variable and ID.  This assumes that no ID has more than one distinct non-missing value for each of those variables.

 

Pass through each ID twice, first to save non-missing values, and the second to retrieved those saved values:

 

data sample;
input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 1
3 . . . 0
3 0 0 1 0
3 0 0 1 1
RUN;

data want (drop=i);
  set sample (in=firstpass)  sample (in=secondpass);
  by id;
  array tmp {3} _temporary_;
  array var {3} sex race eth;
  if first.id then call missing(of tmp{*});
  if firstpass then do i=1 to 3;
    tmp{i}=coalesce(tmp{i},var{i});
  end;
  if secondpass;
  do i=1 to 3;
    var{i}=tmp{i};
  end;
run;

Another, more compact mode, is through use of a MERGE statement, with a merge dataset argument for each variable that needs this type of update.  No other code is needed:

 

data want2;
  merge sample (drop=sex race eth)
        sample (keep=id sex  where=(sex^=.))
        sample (keep=id race where=(race^=.))
        sample (keep=id eth  where=(eth^=.));
  by id;
run;

 

And if you have, say 20 variables needing this treatment, you can avoid naming the SAMPLE dataset 21 times in the merge statement.  You can use the UPDATE statement to prepare one obs per ID, yielding the latest non-missing values for each ID (in dataset NEED).  Then merge with the original dataset:

data need/view=need;
  update sample (keep=id sex race eth  obs=0)
         sample (keep=id sex race eth);
  by id;
run;
data want3;
  merge sample (drop=sex race eth)
        need;
  by id;
run;

Note that NEED is a dataset VIEW, not a dataset FILE.  That means it isn't actually executed until NEED is accessed in a later step.  As a result, SAMPLE is simultaneously accessed to satisfy the DATA NEED step and the DATA WANT step, minimizing the burden on disk retrieval in the case of large datasets.  

 

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

5 REPLIES 5
mkeintz
PROC Star

As I understand it, for each variable (SEX, RACE, ETH) you want to replace instances of missing values with whatever non-missing value is found for that variable and ID.  This assumes that no ID has more than one distinct non-missing value for each of those variables.

 

Pass through each ID twice, first to save non-missing values, and the second to retrieved those saved values:

 

data sample;
input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 1
3 . . . 0
3 0 0 1 0
3 0 0 1 1
RUN;

data want (drop=i);
  set sample (in=firstpass)  sample (in=secondpass);
  by id;
  array tmp {3} _temporary_;
  array var {3} sex race eth;
  if first.id then call missing(of tmp{*});
  if firstpass then do i=1 to 3;
    tmp{i}=coalesce(tmp{i},var{i});
  end;
  if secondpass;
  do i=1 to 3;
    var{i}=tmp{i};
  end;
run;

Another, more compact mode, is through use of a MERGE statement, with a merge dataset argument for each variable that needs this type of update.  No other code is needed:

 

data want2;
  merge sample (drop=sex race eth)
        sample (keep=id sex  where=(sex^=.))
        sample (keep=id race where=(race^=.))
        sample (keep=id eth  where=(eth^=.));
  by id;
run;

 

And if you have, say 20 variables needing this treatment, you can avoid naming the SAMPLE dataset 21 times in the merge statement.  You can use the UPDATE statement to prepare one obs per ID, yielding the latest non-missing values for each ID (in dataset NEED).  Then merge with the original dataset:

data need/view=need;
  update sample (keep=id sex race eth  obs=0)
         sample (keep=id sex race eth);
  by id;
run;
data want3;
  merge sample (drop=sex race eth)
        need;
  by id;
run;

Note that NEED is a dataset VIEW, not a dataset FILE.  That means it isn't actually executed until NEED is accessed in a later step.  As a result, SAMPLE is simultaneously accessed to satisfy the DATA NEED step and the DATA WANT step, minimizing the burden on disk retrieval in the case of large datasets.  

 

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
sasgorilla
Pyrite | Level 9

Thank you for the multiple options! I started with the middle "compact" code and although I did list out the 10+ variables that worked fine. 

 

Thank you for the quick response. 

Kurt_Bremser
Super User

Use a double DO loop:

data sample;
input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 1
3 . . . 0
3 0 0 1 0
3 0 0 1 1
;

data want;
do until (last.id);
  set sample;
  by id;
  _s = coalesce(_s,sex);
  _r = coalesce(_r,race);
  _e = coalesce(_e,eth);
end;
do until (last.id);
  set sample;
  by id;
  sex = _s;
  race = _r;
  eth = _e;
  output;
end;
drop _:;
run;

But since this makes the three variables redundant, you should consider to put them in their own dataset with only one obs per id. Join/merge them in whenever needed.

Tom
Super User Tom
Super User

If the last non-missing value for each ID is acceptable then you can use the UPDATE statement to collapse each variable to last non-missing value.  To reload the detailed values of the other variables re-read those observations.  You can use a pair of DO loops to do it in one data step.

 

First let's change the sample data to have a missing value in one of the variables you want excluded from the process, let's pick the second OUTCOME value for ID=2.

data sample;
  input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 .
3 . . . 0
3 0 0 1 0
3 0 0 1 1
;

The first loop uses the UPDATE to statement to gather the last non-missing value. Basically it treats a empty version of dataset as the original dataset for which transactions are being applied and the whole dataset as the set of transactions.  Then in the second DO loop use the SET statement to re-read the other variables and the OUTPUT statement to generate every observation for this value of ID. 

 

You will need to list either the variables that need updating or those that don't. 

For example you could use a DROP= dataset option to exclude the variables we wanted to "fix" from being reloaded by the SET statement.

data want;
  do until(last.id);
    update sample(obs=0) sample;
    by id;
  end;
  do until(last.id);
    set sample(drop=sex race eth);
    by id;
    output;
  end;
run;

Results:

Obs    id    sex    race    eth    outcome

 1      1     0       2      1        0
 2      1     0       2      1        1
 3      2     1       1      0        1
 4      2     1       1      0        .
 5      3     0       0      1        0
 6      3     0       0      1        0
 7      3     0       0      1        1

If the list of variables to exclude is smaller than the list to include you might change the SET statement to use KEEP= instead.  Make sure to keep the ID variable so the BY statement will work.

    set sample(keep=id outcome);
Ksharp
Super User
data sample;
input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 1
3 . . . 0
3 0 0 1 0
3 0 0 1 1
;
proc sql;
create table want as
select id,
coalesce(sex,max(sex)) as sex,
coalesce(race,max(race)) as race,
coalesce(eth,max(eth)) as eth,
outcome
 from sample
  group by id;
quit;

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1974 views
  • 1 like
  • 5 in conversation