topic Re: How to replace missing values with values from another observation by ID. in New SAS User

How to replace missing values with values from another observation by ID.

sasgorilla — Sat, 09 Nov 2024 04:05:25 GMT

I am using SAS 9.4 and have a dataset that has multiple observations for each ID. For whatever reason, some of these individuals have observations which have missing values for variables (i.e., sex, race, ethnicity) that are present in another observation for the same ID.

See below as an example:

data sample;
input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 1
3 . . . 0
3 0 0 1 0
3 0 0 1 1
;
RUN;

What I am trying to figure out is how to write code that will replace missing values for specific variables (i.e. sex, race, ethnicity) wherever there is a complete value for another observation for the same ID. Of note, there are some variables that may have differing values by observation (see outcome variable) that I do not want to be changed.

Thank you for your help!

Re: How to replace missing values with values from another observation by ID.

mkeintz — Sat, 09 Nov 2024 05:15:19 GMT

As I understand it, for each variable (SEX, RACE, ETH) you want to replace instances of missing values with whatever non-missing value is found for that variable and ID. This assumes that no ID has more than one distinct non-missing value for each of those variables.

Pass through each ID twice, first to save non-missing values, and the second to retrieved those saved values:

data sample;
input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 1
3 . . . 0
3 0 0 1 0
3 0 0 1 1
RUN;

data want (drop=i);
  set sample (in=firstpass)  sample (in=secondpass);
  by id;
  array tmp {3} _temporary_;
  array var {3} sex race eth;
  if first.id then call missing(of tmp{*});
  if firstpass then do i=1 to 3;
    tmp{i}=coalesce(tmp{i},var{i});
  end;
  if secondpass;
  do i=1 to 3;
    var{i}=tmp{i};
  end;
run;

Another, more compact mode, is through use of a MERGE statement, with a merge dataset argument for each variable that needs this type of update. No other code is needed:

data want2;
  merge sample (drop=sex race eth)
        sample (keep=id sex  where=(sex^=.))
        sample (keep=id race where=(race^=.))
        sample (keep=id eth  where=(eth^=.));
  by id;
run;

And if you have, say 20 variables needing this treatment, you can avoid naming the SAMPLE dataset 21 times in the merge statement. You can use the UPDATE statement to prepare one obs per ID, yielding the latest non-missing values for each ID (in dataset NEED). Then merge with the original dataset:

data need/view=need;
  update sample (keep=id sex race eth  obs=0)
         sample (keep=id sex race eth);
  by id;
run;
data want3;
  merge sample (drop=sex race eth)
        need;
  by id;
run;

Note that NEED is a dataset VIEW, not a dataset FILE. That means it isn't actually executed until NEED is accessed in a later step. As a result, SAMPLE is simultaneously accessed to satisfy the DATA NEED step and the DATA WANT step, minimizing the burden on disk retrieval in the case of large datasets.

Re: How to replace missing values with values from another observation by ID.

Kurt_Bremser — Sat, 09 Nov 2024 14:18:33 GMT

Use a double DO loop:

data sample;
input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 1
3 . . . 0
3 0 0 1 0
3 0 0 1 1
;

data want;
do until (last.id);
  set sample;
  by id;
  _s = coalesce(_s,sex);
  _r = coalesce(_r,race);
  _e = coalesce(_e,eth);
end;
do until (last.id);
  set sample;
  by id;
  sex = _s;
  race = _r;
  eth = _e;
  output;
end;
drop _:;
run;

But since this makes the three variables redundant, you should consider to put them in their own dataset with only one obs per id. Join/merge them in whenever needed.

Re: How to replace missing values with values from another observation by ID.

sasgorilla — Sat, 09 Nov 2024 14:46:59 GMT

Thank you for the multiple options! I started with the middle "compact" code and although I did list out the 10+ variables that worked fine.

Thank you for the quick response.

Re: How to replace missing values with values from another observation by ID.

Tom — Sat, 09 Nov 2024 17:15:53 GMT

If the last non-missing value for each ID is acceptable then you can use the UPDATE statement to collapse each variable to last non-missing value. To reload the detailed values of the other variables re-read those observations. You can use a pair of DO loops to do it in one data step.

First let's change the sample data to have a missing value in one of the variables you want excluded from the process, let's pick the second OUTCOME value for ID=2.

data sample;
  input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 .
3 . . . 0
3 0 0 1 0
3 0 0 1 1
;

The first loop uses the UPDATE to statement to gather the last non-missing value. Basically it treats a empty version of dataset as the original dataset for which transactions are being applied and the whole dataset as the set of transactions. Then in the second DO loop use the SET statement to re-read the other variables and the OUTPUT statement to generate every observation for this value of ID.

You will need to list either the variables that need updating or those that don't.

For example you could use a DROP= dataset option to exclude the variables we wanted to "fix" from being reloaded by the SET statement.

data want;
  do until(last.id);
    update sample(obs=0) sample;
    by id;
  end;
  do until(last.id);
    set sample(drop=sex race eth);
    by id;
    output;
  end;
run;

Results:

Obs    id    sex    race    eth    outcome

 1      1     0       2      1        0
 2      1     0       2      1        1
 3      2     1       1      0        1
 4      2     1       1      0        .
 5      3     0       0      1        0
 6      3     0       0      1        0
 7      3     0       0      1        1

If the list of variables to exclude is smaller than the list to include you might change the SET statement to use KEEP= instead. Make sure to keep the ID variable so the BY statement will work.

    set sample(keep=id outcome);

Re: How to replace missing values with values from another observation by ID.

Ksharp — Sun, 10 Nov 2024 01:48:20 GMT

data sample;
input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 1
3 . . . 0
3 0 0 1 0
3 0 0 1 1
;
proc sql;
create table want as
select id,
coalesce(sex,max(sex)) as sex,
coalesce(race,max(race)) as race,
coalesce(eth,max(eth)) as eth,
outcome
 from sample
  group by id;
quit;