I tried to replace missing character values (alcohol_user) with the closest non-missing values (either before or after) within the same DMRN.
This is how my data set looks like:
DMRN DX_DATE ALCOHOL_USER
16 28FEB2017
16 13JUN2017 No
16 26SEP2017 No
24 12MAY2014 No
24 28MAR2016
24 25AUG2017
25 02SEP2021
25 29NOV2021 Yes
The outcome that I would like it to be:
DMRN DX_DATE ALCOHOL_USER
16 28FEB2017 No
16 13JUN2017 No
16 26SEP2017 No
24 12MAY2014 No
24 28MAR2016 No
24 25AUG2017 No
25 02SEP2021 Yes
25 29NOV2021 Yes
I tried this code but it didn't work out. Any help would be very appreciated. Thank you!
In the absence of sample data in the form of a working DATA step, this should work.
data vneed / view=vneed;
set have (rename=(alcohol_user=_alc_user));
by dmrn;
where _alc_user^=' ';
dx_date=ifn(first.dmrn=1,'01jan1900'd,mean(dx_date,lag(dx_date)),'01jan1900'd);
run;
data want (drop=_:);
set have (in=inh) vneed (in=inv keep=dmrn dx_date);
by dmrn dx_date;
if inv then set vneed (keep=_alc_user);
if inh;
if alcohol_user=' ' then alcohol_user=_alc_user;
run;
How does this work? Data set view VNEED has only non-missing alcohol_user data (renamed to _alc_user). But its dx_dates are changed. dx_date is reset to halfway between the current valid obs and the prior valid obs ("valid" means non-missing _alc_user for the same DMRN). And in the case of the first non-missing _ALC_USER for a given DMRN, the DX_DATE is set back to 01jan1900.
As a result, when VNEED is interleaved with HAVE (the SET HAVE ... VNEED; BY DMRN DX_DATE statements), one can use the IF INV condition to read in (and retain) the non-missing _ALC_USER variable, until the next IF INV is satisfied. And since the DX_DATEs in vneed are at halfway points between observed DX_DATEs, missing ALCOHOL_USER will always get the closest DX_DATE with non-missing values.
Important note: the first SET statement in DATA WANT, must NOT read in _ALC_USER. Therefore _ALC_USER is not automatically replaced, until the IF INV condition is satisfied.
Edited note: This assume data are sorted by DMRN/DX_DATE.
@Patrick wrote:
@mkeintz Appears to be a "duplicate" question with an accepted solution here
True.
The code I offered above takes advantage of the fact that the data are sorted, which AFAICT, SQL does not. There is likely to be a significant performance benefit.
It uses the same technique as the "carry the future back" section in a presentation I gave at PharmaSUG 2019 (History Carried Forward, Future Carried Back: Mixing Time Series of Differing Frequencies), except it mixes a subset of a series with the whole series.
One of the reasons I have come to promote this technique is that it is fairly simple to expand mixing of two series to three or more series, which I think can become rather messy in SQL. Applying it to finding the "closest" value only explicitly occurred to me when contemplating this OP's post.
Plug: I'll be presenting content from the PharmaSUG paper at next month's Boston Area SAS User's group. And I hope to include an example of "finding the closest" as well.
Load your data into a date-indexed array:
%let start = %sysfunc(inputn(2000-01-01,yymmdd10.));
%let end = %sysfunc(inputn(2024-12-31,yymmdd10.));
data want;
array vals {&start.:&end.} $3 _temporary_;
do i = &start. to &end.;
vals{i} = "";
end;
do until (last.dmrn);
set have;
by dmrn;
vals{dx_date} = alcohol_user;
end;
do until (last.dmrn);
set have;
by dmrn;
if alcohol_user = ""
then do;
dist = &end. - & start.;
do i = &start. to &end.;
if vals{i} ne "" and abs(i - dx_date) < dist
then do;
dist = abs(i - dx_date);
alcohol_user = vals{i};
end;
end;
output;
end;
drop dist i;
run;
Untested, posted from my tablet.
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.