BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
tan-wongv
Obsidian | Level 7

I would like to replace missing value with closest value from the same DMRN, but the dx_date should be less than 365 days.

This is how my data look like:

DMRN         dx_date             BMI

30                04FEB2013      35.82

30                04JUN2013        .

30                04NOV2014    26.55

25                 22JAN2016     30.42

25                 25MAR2022      .

 

This is how I want it to be;

DMRN          dx_date           BMI

30                04FEB2013      35.82

30                04JUN2013      35.82

30                04NOV2013    26.55

25                 22JAN2016     30.42

25                 25MAR2022      .

What codes should I use?

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

If you don't have a lot of data with a lot of missing values then below should suffice.

data have;
 input DMRN dx_date:date9. BMI;
 format dx_date date9.;
 datalines;
30 04FEB2013 .
30 04MAR2013 99.10
30 03JUN2013 1
30 04JUN2013 .
30 05JUN2013 2
30 06JUN2013 .
30 07JUN2013 .
30 08JUN2013 3
30 04NOV2014 26.55
25 22JAN2016 30.42
25 25MAR2022 .
;

proc sql;
  select 
    l.*,
    case
      when missing(l.bmi) then (select r.bmi
                                from
                                  (
                                    select r.dx_date,r.bmi
                                    from have r 
                                    where r.dmrn=l.dmrn and not missing(r.bmi)
                                    having abs(r.dx_date-l.dx_date)=min(abs(r.dx_date-l.dx_date)) 
                                           and abs(r.dx_date-l.dx_date)<365
                                  ) r
                                  having min(r.dx_date)=r.dx_date
                                )
      else l.bmi
      end as bmi_2
  from have l
  order by l.dmrn, l.dx_date
  ;
quit;

If you need time series data in the end then consider using Proc Expand instead (requires SAS/ETS licensed).

View solution in original post

7 REPLIES 7
Patrick
Opal | Level 21

If you don't have a lot of data with a lot of missing values then below should suffice.

data have;
 input DMRN dx_date:date9. BMI;
 format dx_date date9.;
 datalines;
30 04FEB2013 .
30 04MAR2013 99.10
30 03JUN2013 1
30 04JUN2013 .
30 05JUN2013 2
30 06JUN2013 .
30 07JUN2013 .
30 08JUN2013 3
30 04NOV2014 26.55
25 22JAN2016 30.42
25 25MAR2022 .
;

proc sql;
  select 
    l.*,
    case
      when missing(l.bmi) then (select r.bmi
                                from
                                  (
                                    select r.dx_date,r.bmi
                                    from have r 
                                    where r.dmrn=l.dmrn and not missing(r.bmi)
                                    having abs(r.dx_date-l.dx_date)=min(abs(r.dx_date-l.dx_date)) 
                                           and abs(r.dx_date-l.dx_date)<365
                                  ) r
                                  having min(r.dx_date)=r.dx_date
                                )
      else l.bmi
      end as bmi_2
  from have l
  order by l.dmrn, l.dx_date
  ;
quit;

If you need time series data in the end then consider using Proc Expand instead (requires SAS/ETS licensed).

tan-wongv
Obsidian | Level 7
Thank you so much for your help!
mkeintz
PROC Star

@Tom pointed out that my original code did not accomodate situations where a DMRN group begins with a missing BMI value.  So here is a data step approach that requires presorted the data by DMRN,DX_DATE.

 

data have;
 input DMRN dx_date:date9. BMI;
 format dx_date date9.;
 datalines;
30 04FEB2013 .
30 04MAR2013 99.10
30 03JUN2013 1
30 04JUN2013 .
30 05JUN2013 2
30 06JUN2013 .
30 07JUN2013 .
30 08JUN2013 3
30 04NOV2014 26.55
25 22JAN2016 30.42
25 25MAR2022 .
;


proc sort data=have;
  by dmrn dx_date;
run;

data non_Missing_windows  /view=non_Missing_windows; 
  set have (rename=(dx_date=nm_dte bmi=nm_bmi) where=(nm_bmi^=.)); /*Read only non-missing BMI's*/
  by dmrn;
  dx_date=nm_dte - ifn(first.dmrn,365,min(365,0.5*dif(nm_dte)));
  format dx_date date9.;
run;

data want (drop=nm_:);
  set have (in=inhave)
      non_missing_windows (in=nmw keep=dmrn dx_date) ;
  by dmrn dx_date;
  if nmw then set non_missing_windows (keep=nm_dte nm_bmi);
  else if first.dmrn then call missing(of nm_:);

  if inhave;
  new_bmi=ifn(bmi=. and dx_date-365<=nm_dte,nm_bmi,bmi);
run;

The code below is the original response, but does not properly process DMRN's that start with a missing BMI:

data have;
input DMRN         dx_date  :date9.             BMI;
format dx_date date9.;
datalines;
30                04FEB2013      35.82
30                04JUN2013        .
30                04NOV2014      26.55
25                 22JAN2016     30.42
25                 25MAR2022      .
run;

data want (drop=prv_:  nxt_:);
  set have;
  by dmrn notsorted;

  if bmi^=. then do;  /*If non-missing BMI, update PRV_ and NXT_ */
    new_bmi=bmi;
    merge have (firstobs=1 rename=(dmrn=prv_dmrn dx_date=prv_dte bmi=prv_bmi) where=(prv_bmi^=.))
          have (firstobs=2 rename=(dmrn=nxt_dmrn dx_date=nxt_dte bmi=nxt_bmi) where=(nxt_bmi^=.)) ;
  end;
  else do;
    /* Eliminate Non-eligible or ancient history*/
    if prv_dmrn^=dmrn or prv_dte<dx_date-365 then call missing(of prv_:);

    /* Copy PREV, if next is too far or for another DMRN  */
    /* Copy NEXT if it is near enough and PREV is missing */
    if nxt_dmrn ^= dmrn or nxt_dte> dx_date+365 then new_bmi=prv_bmi; 
    else if prv_dte=. and nxt_dmrn=dmrn and nxt_dte <= dx_date+365 then new_bmi=nxt_bmi;

    else do;  /*But if Both prev and next qualify, compare distances */
      prv_nxt_mean_dte=mean(prv_dte,nxt_dte);
      if prv_nxt_mean_dte > dx_date then new_bmi=nxt_bmi;
      else new_bmi=prv_bmi;
    end;
  end;
run;

I've created a variable NEW_BMI so you can see the before and after date step values of BMI.

 

The primary tool here is the conditional use of the MERGE statement. 

  1.   First, because the MERGE is only run conditionally (i.e. when BMI^=.) the values it reads in are automatically "retained" across subsequent observations that have missing BMI.  That is, they are retained until the next MERGE.
  2. Second, one of the merge datasets is the HAVE dataset, but starting at firstobs=2, which allows access to the next non-missing value of BMI.
  3. Note both datasets in the MERGE have qualifying WHERE parameters to ensure that only non-missing BMI values are used for the previous and next values.

 

Then, when the current BMI is missing, one can examine the PRV (previous) and NXT non-missing values to get BMI from the closest date.  BTW, if the previous and next dates are equally distant, the code above takes the previous BMI.

 

This program assumes that each DMRN is sorted by DX_DATE.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Tom
Super User Tom
Super User

I don't think you have thought through all of the possible combinations of missing BMI values.

 

For example if the first observation has a missing value of BMI then the NXT_.... variables are not populated since the MERGE never happens.

 

Not sure if there are others since I did not test every scenario.

mkeintz
PROC Star

@Tom wrote:

I don't think you have thought through all of the possible combinations of missing BMI values.

 

For example if the first observation has a missing value of BMI then the NXT_.... variables are not populated since the MERGE never happens.

 

Not sure if there are others since I did not test every scenario.


You're right.  Revision needed.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
tan-wongv
Obsidian | Level 7
Thank you so much for your response!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 552 views
  • 2 likes
  • 5 in conversation