Solved: How to replace missing values with values from another observation by ...

sasgorilla · Posted 11-08-2024 11:01 PM

I am using SAS 9.4 and have a dataset that has multiple observations for each ID. For whatever reason, some of these individuals have observations which have missing values for variables (i.e., sex, race, ethnicity) that are present in another observation for the same ID.

See below as an example:

data sample;
input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 1
3 . . . 0
3 0 0 1 0
3 0 0 1 1
;
RUN;

What I am trying to figure out is how to write code that will replace missing values for specific variables (i.e. sex, race, ethnicity) wherever there is a complete value for another observation for the same ID. Of note, there are some variables that may have differing values by observation (see outcome variable) that I do not want to be changed.

Thank you for your help!

mkeintz · Posted 11-08-2024 11:35 PM

As I understand it, for each variable (SEX, RACE, ETH) you want to replace instances of missing values with whatever non-missing value is found for that variable and ID. This assumes that no ID has more than one distinct non-missing value for each of those variables.

Pass through each ID twice, first to save non-missing values, and the second to retrieved those saved values:

data sample;
input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 1
3 . . . 0
3 0 0 1 0
3 0 0 1 1
RUN;

data want (drop=i);
  set sample (in=firstpass)  sample (in=secondpass);
  by id;
  array tmp {3} _temporary_;
  array var {3} sex race eth;
  if first.id then call missing(of tmp{*});
  if firstpass then do i=1 to 3;
    tmp{i}=coalesce(tmp{i},var{i});
  end;
  if secondpass;
  do i=1 to 3;
    var{i}=tmp{i};
  end;
run;

Another, more compact mode, is through use of a MERGE statement, with a merge dataset argument for each variable that needs this type of update. No other code is needed:

data want2;
  merge sample (drop=sex race eth)
        sample (keep=id sex  where=(sex^=.))
        sample (keep=id race where=(race^=.))
        sample (keep=id eth  where=(eth^=.));
  by id;
run;

And if you have, say 20 variables needing this treatment, you can avoid naming the SAMPLE dataset 21 times in the merge statement. You can use the UPDATE statement to prepare one obs per ID, yielding the latest non-missing values for each ID (in dataset NEED). Then merge with the original dataset:

data need/view=need;
  update sample (keep=id sex race eth  obs=0)
         sample (keep=id sex race eth);
  by id;
run;
data want3;
  merge sample (drop=sex race eth)
        need;
  by id;
run;

Note that NEED is a dataset VIEW, not a dataset FILE. That means it isn't actually executed until NEED is accessed in a later step. As a result, SAMPLE is simultaneously accessed to satisfy the DATA NEED step and the DATA WANT step, minimizing the burden on disk retrieval in the case of large datasets.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 11-08-2024 11:35 PM

As I understand it, for each variable (SEX, RACE, ETH) you want to replace instances of missing values with whatever non-missing value is found for that variable and ID. This assumes that no ID has more than one distinct non-missing value for each of those variables.

Pass through each ID twice, first to save non-missing values, and the second to retrieved those saved values:

data sample;
input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 1
3 . . . 0
3 0 0 1 0
3 0 0 1 1
RUN;

data want (drop=i);
  set sample (in=firstpass)  sample (in=secondpass);
  by id;
  array tmp {3} _temporary_;
  array var {3} sex race eth;
  if first.id then call missing(of tmp{*});
  if firstpass then do i=1 to 3;
    tmp{i}=coalesce(tmp{i},var{i});
  end;
  if secondpass;
  do i=1 to 3;
    var{i}=tmp{i};
  end;
run;

Another, more compact mode, is through use of a MERGE statement, with a merge dataset argument for each variable that needs this type of update. No other code is needed:

data want2;
  merge sample (drop=sex race eth)
        sample (keep=id sex  where=(sex^=.))
        sample (keep=id race where=(race^=.))
        sample (keep=id eth  where=(eth^=.));
  by id;
run;

And if you have, say 20 variables needing this treatment, you can avoid naming the SAMPLE dataset 21 times in the merge statement. You can use the UPDATE statement to prepare one obs per ID, yielding the latest non-missing values for each ID (in dataset NEED). Then merge with the original dataset:

data need/view=need;
  update sample (keep=id sex race eth  obs=0)
         sample (keep=id sex race eth);
  by id;
run;
data want3;
  merge sample (drop=sex race eth)
        need;
  by id;
run;

Note that NEED is a dataset VIEW, not a dataset FILE. That means it isn't actually executed until NEED is accessed in a later step. As a result, SAMPLE is simultaneously accessed to satisfy the DATA NEED step and the DATA WANT step, minimizing the burden on disk retrieval in the case of large datasets.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sasgorilla · Posted 11-09-2024 09:46 AM

Thank you for the multiple options! I started with the middle "compact" code and although I did list out the 10+ variables that worked fine.

Thank you for the quick response.

Kurt_Bremser · Posted 11-09-2024 09:18 AM

Use a double DO loop:

data sample;
input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 1
3 . . . 0
3 0 0 1 0
3 0 0 1 1
;

data want;
do until (last.id);
  set sample;
  by id;
  _s = coalesce(_s,sex);
  _r = coalesce(_r,race);
  _e = coalesce(_e,eth);
end;
do until (last.id);
  set sample;
  by id;
  sex = _s;
  race = _r;
  eth = _e;
  output;
end;
drop _:;
run;

But since this makes the three variables redundant, you should consider to put them in their own dataset with only one obs per id. Join/merge them in whenever needed.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Tom · Posted 11-09-2024 12:08 PM

If the last non-missing value for each ID is acceptable then you can use the UPDATE statement to collapse each variable to last non-missing value. To reload the detailed values of the other variables re-read those observations. You can use a pair of DO loops to do it in one data step.

First let's change the sample data to have a missing value in one of the variables you want excluded from the process, let's pick the second OUTCOME value for ID=2.

data sample;
  input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 .
3 . . . 0
3 0 0 1 0
3 0 0 1 1
;

The first loop uses the UPDATE to statement to gather the last non-missing value. Basically it treats a empty version of dataset as the original dataset for which transactions are being applied and the whole dataset as the set of transactions. Then in the second DO loop use the SET statement to re-read the other variables and the OUTPUT statement to generate every observation for this value of ID.

You will need to list either the variables that need updating or those that don't.

For example you could use a DROP= dataset option to exclude the variables we wanted to "fix" from being reloaded by the SET statement.

data want;
  do until(last.id);
    update sample(obs=0) sample;
    by id;
  end;
  do until(last.id);
    set sample(drop=sex race eth);
    by id;
    output;
  end;
run;

Results:

Obs    id    sex    race    eth    outcome

 1      1     0       2      1        0
 2      1     0       2      1        1
 3      2     1       1      0        1
 4      2     1       1      0        .
 5      3     0       0      1        0
 6      3     0       0      1        0
 7      3     0       0      1        1

If the list of variables to exclude is smaller than the list to include you might change the SET statement to use KEEP= instead. Make sure to keep the ID variable so the BY statement will work.

    set sample(keep=id outcome);

Ksharp · Posted 11-09-2024 08:48 PM

data sample;
input id sex race eth outcome;
datalines;
1 0 2 1 0
1 . 2 . 1
2 1 . 0 1
2 1 1 0 1
3 . . . 0
3 0 0 1 0
3 0 0 1 1
;
proc sql;
create table want as
select id,
coalesce(sex,max(sex)) as sex,
coalesce(race,max(race)) as race,
coalesce(eth,max(eth)) as eth,
outcome
 from sample
  group by id;
quit;

How to replace missing values with values from another observation by ID.

Re: How to replace missing values with values from another observation by ID.

Re: How to replace missing values with values from another observation by ID.

Re: How to replace missing values with values from another observation by ID.

Re: How to replace missing values with values from another observation by ID.

Re: How to replace missing values with values from another observation by ID.

Re: How to replace missing values with values from another observation by ID.

How to replace missing values with values from another observation by ID.

Re: How to replace missing values with values from another observation by ID.

Re: How to replace missing values with values from another observation by ID.

Re: How to replace missing values with values from another observation by ID.

Re: How to replace missing values with values from another observation by ID.

Re: How to replace missing values with values from another observation by ID.

Re: How to replace missing values with values from another observation by ID.

Registration is open