Replacing missing character values with closest non-missing values wit...

tan-wongv · Posted 03-13-2024 07:16 PM

I tried to replace missing character values (alcohol_user) with the closest non-missing values (either before or after) within the same DMRN.

This is how my data set looks like:

DMRN DX_DATE ALCOHOL_USER

16 28FEB2017

16 13JUN2017 No

16 26SEP2017 No

24 12MAY2014 No

24 28MAR2016

24 25AUG2017

25 02SEP2021

25 29NOV2021 Yes

The outcome that I would like it to be:

DMRN DX_DATE ALCOHOL_USER

16 28FEB2017 No

16 13JUN2017 No

16 26SEP2017 No

24 12MAY2014 No

24 28MAR2016 No

24 25AUG2017 No

25 02SEP2021 Yes

25 29NOV2021 Yes

I tried this code but it didn't work out. Any help would be very appreciated. Thank you!

mkeintz

In the absence of sample data in the form of a working DATA step, this should work.

data vneed / view=vneed;
  set have (rename=(alcohol_user=_alc_user));
  by dmrn;
  where _alc_user^=' ';
  dx_date=ifn(first.dmrn=1,'01jan1900'd,mean(dx_date,lag(dx_date)),'01jan1900'd);
run;

data want (drop=_:);
  set have (in=inh)  vneed (in=inv keep=dmrn dx_date);
  by dmrn dx_date;
  if inv then set vneed (keep=_alc_user);
  if inh;
  if alcohol_user=' ' then alcohol_user=_alc_user;
run;

How does this work? Data set view VNEED has only non-missing alcohol_user data (renamed to _alc_user). But its dx_dates are changed. dx_date is reset to halfway between the current valid obs and the prior valid obs ("valid" means non-missing _alc_user for the same DMRN). And in the case of the first non-missing _ALC_USER for a given DMRN, the DX_DATE is set back to 01jan1900.

As a result, when VNEED is interleaved with HAVE (the SET HAVE ... VNEED; BY DMRN DX_DATE statements), one can use the IF INV condition to read in (and retain) the non-missing _ALC_USER variable, until the next IF INV is satisfied. And since the DX_DATEs in vneed are at halfway points between observed DX_DATEs, missing ALCOHOL_USER will always get the closest DX_DATE with non-missing values.

Important note: the first SET statement in DATA WANT, must NOT read in _ALC_USER. Therefore _ALC_USER is not automatically replaced, until the IF INV condition is satisfied.

Edited note: This assume data are sorted by DMRN/DX_DATE.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Patrick

@mkeintz Appears to be a "duplicate" question with an accepted solution here

mkeintz

@Patrick wrote:

@mkeintz Appears to be a "duplicate" question with an accepted solution here

True.

The code I offered above takes advantage of the fact that the data are sorted, which AFAICT, SQL does not. There is likely to be a significant performance benefit.

It uses the same technique as the "carry the future back" section in a presentation I gave at PharmaSUG 2019 (History Carried Forward, Future Carried Back: Mixing Time Series of Differing Frequencies), except it mixes a subset of a series with the whole series.

One of the reasons I have come to promote this technique is that it is fairly simple to expand mixing of two series to three or more series, which I think can become rather messy in SQL. Applying it to finding the "closest" value only explicitly occurred to me when contemplating this OP's post.

Plug: I'll be presenting content from the PharmaSUG paper at next month's Boston Area SAS User's group. And I hope to include an example of "finding the closest" as well.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Kurt_Bremser

Load your data into a date-indexed array:

%let start = %sysfunc(inputn(2000-01-01,yymmdd10.));
%let end = %sysfunc(inputn(2024-12-31,yymmdd10.));

data want;
array vals {&start.:&end.} $3 _temporary_;
do i = &start. to &end.;
  vals{i} = "";
end;
do until (last.dmrn);
  set have;
  by dmrn;
  vals{dx_date} = alcohol_user;
end;
do until (last.dmrn);
  set have;
  by dmrn;
  if alcohol_user = ""
  then do;
  dist = &end. - & start.;
  do i = &start. to &end.;
    if vals{i} ne "" and abs(i - dx_date) < dist
    then do;
      dist = abs(i - dx_date);
      alcohol_user = vals{i};
    end;
  end;
  output;
end;
drop dist i;
run;

Untested, posted from my tablet.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Replacing missing character values with closest non-missing values within the same DMRN

Re: Replacing missing character values with closest non-missing values within the same DMRN

Re: Replacing missing character values with closest non-missing values within the same DMRN

Re: Replacing missing character values with closest non-missing values within the same DMRN

Re: Replacing missing character values with closest non-missing values within the same DMRN

Replacing missing character values with closest non-missing values within the same DMRN

Re: Replacing missing character values with closest non-missing values within the same DMRN

Re: Replacing missing character values with closest non-missing values within the same DMRN

Re: Replacing missing character values with closest non-missing values within the same DMRN

Re: Replacing missing character values with closest non-missing values within the same DMRN

Click image to register for webinar

Classroom Training Available!