BookmarkSubscribeRSS Feed
tan-wongv
Obsidian | Level 7

I tried to replace missing character values (alcohol_user) with the  closest non-missing values (either before or after) within the same DMRN. 

 

This is how my data set looks like:

DMRN          DX_DATE          ALCOHOL_USER

16                    28FEB2017 

16                    13JUN2017                 No

16                    26SEP2017                 No

24                   12MAY2014                No

24                   28MAR2016

24                   25AUG2017

25                   02SEP2021

25                   29NOV2021               Yes

 

The outcome that I would like it to be:

DMRN          DX_DATE          ALCOHOL_USER

16                    28FEB2017                No

16                    13JUN2017                 No

16                    26SEP2017                 No

24                   12MAY2014                No

24                   28MAR2016               No

24                   25AUG2017                No

25                   02SEP2021                Yes

25                   29NOV2021               Yes

 

I tried this code but it didn't work out. Any help would be very appreciated. Thank you!

Screenshot 2024-03-13 001420.png

4 REPLIES 4
mkeintz
PROC Star

In the absence of sample data in the form of a working DATA step, this should work.

 

data vneed / view=vneed;
  set have (rename=(alcohol_user=_alc_user));
  by dmrn;
  where _alc_user^=' ';
  dx_date=ifn(first.dmrn=1,'01jan1900'd,mean(dx_date,lag(dx_date)),'01jan1900'd);
run;

data want (drop=_:);
  set have (in=inh)  vneed (in=inv keep=dmrn dx_date);
  by dmrn dx_date;
  if inv then set vneed (keep=_alc_user);
  if inh;
  if alcohol_user=' ' then alcohol_user=_alc_user;
run;

How does this work?   Data set view VNEED has only non-missing alcohol_user data (renamed to _alc_user).  But its dx_dates are changed.   dx_date is reset to halfway between the current valid obs and the prior valid obs ("valid" means non-missing _alc_user for the same DMRN).   And in the case of the first non-missing _ALC_USER for a given DMRN, the DX_DATE is set back to 01jan1900.

 

As a result, when VNEED is interleaved with HAVE (the SET HAVE ... VNEED; BY DMRN DX_DATE statements), one can use the IF INV condition to read in (and retain) the non-missing _ALC_USER variable, until the next IF INV is satisfied.  And since the DX_DATEs in vneed are at halfway points between observed DX_DATEs, missing ALCOHOL_USER will always get the closest DX_DATE with non-missing values.

 

Important note:  the first SET statement in DATA WANT, must NOT read in _ALC_USER.  Therefore _ALC_USER is not automatically replaced, until the IF INV condition is satisfied.

 

Edited note: This assume data are sorted by DMRN/DX_DATE.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Patrick
Opal | Level 21

@mkeintz Appears to be a "duplicate" question with an accepted solution here

mkeintz
PROC Star

@Patrick wrote:

@mkeintz Appears to be a "duplicate" question with an accepted solution here


True.  

 

The code I offered above takes advantage of the fact that the data are sorted, which AFAICT, SQL does not.  There is likely to be a significant performance benefit.

 

It uses the same technique as the "carry the future back" section in a presentation I gave at PharmaSUG 2019 (History Carried Forward, Future Carried Back: Mixing Time Series of Differing Frequencies), except it mixes a subset of a series with the whole series.

 

One of the reasons I have come to promote this technique is that it is fairly simple to expand mixing of two series to three or more series, which I think can become rather messy in SQL.  Applying it to finding the "closest" value only explicitly occurred to me when contemplating this OP's post.

 

Plug:  I'll be presenting content from the PharmaSUG paper at next month's Boston Area SAS User's group.  And I hope to include an example of "finding the closest" as well.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Kurt_Bremser
Super User

Load your data into a date-indexed array:

%let start = %sysfunc(inputn(2000-01-01,yymmdd10.));
%let end = %sysfunc(inputn(2024-12-31,yymmdd10.));

data want;
array vals {&start.:&end.} $3 _temporary_;
do i = &start. to &end.;
  vals{i} = "";
end;
do until (last.dmrn);
  set have;
  by dmrn;
  vals{dx_date} = alcohol_user;
end;
do until (last.dmrn);
  set have;
  by dmrn;
  if alcohol_user = ""
  then do;
  dist = &end. - & start.;
  do i = &start. to &end.;
    if vals{i} ne "" and abs(i - dx_date) < dist
    then do;
      dist = abs(i - dx_date);
      alcohol_user = vals{i};
    end;
  end;
  output;
end;
drop dist i;
run;

Untested, posted from my tablet.

sas-innovate-white.png

Special offer for SAS Communities members

Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.

 

View the full agenda.

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 984 views
  • 0 likes
  • 4 in conversation