I tried to replace missing character values (alcohol_user) with the closest non-missing values (either before or after) within the same DMRN.
This is how my data set looks like:
DMRN DX_DATE ALCOHOL_USER
16 28FEB2017
16 13JUN2017 No
16 26SEP2017 No
24 12MAY2014 No
24 28MAR2016
24 25AUG2017
25 02SEP2021
25 29NOV2021 Yes
The outcome that I would like it to be:
DMRN DX_DATE ALCOHOL_USER
16 28FEB2017 No
16 13JUN2017 No
16 26SEP2017 No
24 12MAY2014 No
24 28MAR2016 No
24 25AUG2017 No
25 02SEP2021 Yes
25 29NOV2021 Yes
I tried this code but it didn't work out. Any help would be very appreciated. Thank you!
In the absence of sample data in the form of a working DATA step, this should work.
data vneed / view=vneed;
set have (rename=(alcohol_user=_alc_user));
by dmrn;
where _alc_user^=' ';
dx_date=ifn(first.dmrn=1,'01jan1900'd,mean(dx_date,lag(dx_date)),'01jan1900'd);
run;
data want (drop=_:);
set have (in=inh) vneed (in=inv keep=dmrn dx_date);
by dmrn dx_date;
if inv then set vneed (keep=_alc_user);
if inh;
if alcohol_user=' ' then alcohol_user=_alc_user;
run;
How does this work? Data set view VNEED has only non-missing alcohol_user data (renamed to _alc_user). But its dx_dates are changed. dx_date is reset to halfway between the current valid obs and the prior valid obs ("valid" means non-missing _alc_user for the same DMRN). And in the case of the first non-missing _ALC_USER for a given DMRN, the DX_DATE is set back to 01jan1900.
As a result, when VNEED is interleaved with HAVE (the SET HAVE ... VNEED; BY DMRN DX_DATE statements), one can use the IF INV condition to read in (and retain) the non-missing _ALC_USER variable, until the next IF INV is satisfied. And since the DX_DATEs in vneed are at halfway points between observed DX_DATEs, missing ALCOHOL_USER will always get the closest DX_DATE with non-missing values.
Important note: the first SET statement in DATA WANT, must NOT read in _ALC_USER. Therefore _ALC_USER is not automatically replaced, until the IF INV condition is satisfied.
Edited note: This assume data are sorted by DMRN/DX_DATE.
@Patrick wrote:
@mkeintz Appears to be a "duplicate" question with an accepted solution here
True.
The code I offered above takes advantage of the fact that the data are sorted, which AFAICT, SQL does not. There is likely to be a significant performance benefit.
It uses the same technique as the "carry the future back" section in a presentation I gave at PharmaSUG 2019 (History Carried Forward, Future Carried Back: Mixing Time Series of Differing Frequencies), except it mixes a subset of a series with the whole series.
One of the reasons I have come to promote this technique is that it is fairly simple to expand mixing of two series to three or more series, which I think can become rather messy in SQL. Applying it to finding the "closest" value only explicitly occurred to me when contemplating this OP's post.
Plug: I'll be presenting content from the PharmaSUG paper at next month's Boston Area SAS User's group. And I hope to include an example of "finding the closest" as well.
Load your data into a date-indexed array:
%let start = %sysfunc(inputn(2000-01-01,yymmdd10.));
%let end = %sysfunc(inputn(2024-12-31,yymmdd10.));
data want;
array vals {&start.:&end.} $3 _temporary_;
do i = &start. to &end.;
vals{i} = "";
end;
do until (last.dmrn);
set have;
by dmrn;
vals{dx_date} = alcohol_user;
end;
do until (last.dmrn);
set have;
by dmrn;
if alcohol_user = ""
then do;
dist = &end. - & start.;
do i = &start. to &end.;
if vals{i} ne "" and abs(i - dx_date) < dist
then do;
dist = abs(i - dx_date);
alcohol_user = vals{i};
end;
end;
output;
end;
drop dist i;
run;
Untested, posted from my tablet.
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.