Solved: Assign the same value to identical IDs with date difference less than ...

tan-wongv · Posted 03-31-2024 10:37 PM

I would like to replace missing value with closest value from the same DMRN, but the dx_date should be less than 365 days.

This is how my data look like:

DMRN dx_date BMI

30 04FEB2013 35.82

30 04JUN2013 .

30 04NOV2014 26.55

25 22JAN2016 30.42

25 25MAR2022 .

This is how I want it to be;

DMRN dx_date BMI

30 04FEB2013 35.82

30 04JUN2013 35.82

30 04NOV2013 26.55

25 22JAN2016 30.42

25 25MAR2022 .

What codes should I use?

Patrick · Posted 04-01-2024 01:54 AM

If you don't have a lot of data with a lot of missing values then below should suffice.

data have;
 input DMRN dx_date:date9. BMI;
 format dx_date date9.;
 datalines;
30 04FEB2013 .
30 04MAR2013 99.10
30 03JUN2013 1
30 04JUN2013 .
30 05JUN2013 2
30 06JUN2013 .
30 07JUN2013 .
30 08JUN2013 3
30 04NOV2014 26.55
25 22JAN2016 30.42
25 25MAR2022 .
;

proc sql;
  select 
    l.*,
    case
      when missing(l.bmi) then (select r.bmi
                                from
                                  (
                                    select r.dx_date,r.bmi
                                    from have r 
                                    where r.dmrn=l.dmrn and not missing(r.bmi)
                                    having abs(r.dx_date-l.dx_date)=min(abs(r.dx_date-l.dx_date)) 
                                           and abs(r.dx_date-l.dx_date)<365
                                  ) r
                                  having min(r.dx_date)=r.dx_date
                                )
      else l.bmi
      end as bmi_2
  from have l
  order by l.dmrn, l.dx_date
  ;
quit;

If you need time series data in the end then consider using Proc Expand instead (requires SAS/ETS licensed).

View solution in original post

Patrick · Posted 04-01-2024 01:54 AM

If you don't have a lot of data with a lot of missing values then below should suffice.

data have;
 input DMRN dx_date:date9. BMI;
 format dx_date date9.;
 datalines;
30 04FEB2013 .
30 04MAR2013 99.10
30 03JUN2013 1
30 04JUN2013 .
30 05JUN2013 2
30 06JUN2013 .
30 07JUN2013 .
30 08JUN2013 3
30 04NOV2014 26.55
25 22JAN2016 30.42
25 25MAR2022 .
;

proc sql;
  select 
    l.*,
    case
      when missing(l.bmi) then (select r.bmi
                                from
                                  (
                                    select r.dx_date,r.bmi
                                    from have r 
                                    where r.dmrn=l.dmrn and not missing(r.bmi)
                                    having abs(r.dx_date-l.dx_date)=min(abs(r.dx_date-l.dx_date)) 
                                           and abs(r.dx_date-l.dx_date)<365
                                  ) r
                                  having min(r.dx_date)=r.dx_date
                                )
      else l.bmi
      end as bmi_2
  from have l
  order by l.dmrn, l.dx_date
  ;
quit;

If you need time series data in the end then consider using Proc Expand instead (requires SAS/ETS licensed).

tan-wongv · Posted 04-02-2024 01:46 AM

Thank you so much for your help!

Kurt_Bremser · Posted 04-01-2024 02:46 AM

Do you want to only carry forward from previous events, or also backward from future events, if these are nearer?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mkeintz · Posted 04-01-2024 08:02 AM

@Tom pointed out that my original code did not accomodate situations where a DMRN group begins with a missing BMI value. So here is a data step approach that requires presorted the data by DMRN,DX_DATE.

data have;
 input DMRN dx_date:date9. BMI;
 format dx_date date9.;
 datalines;
30 04FEB2013 .
30 04MAR2013 99.10
30 03JUN2013 1
30 04JUN2013 .
30 05JUN2013 2
30 06JUN2013 .
30 07JUN2013 .
30 08JUN2013 3
30 04NOV2014 26.55
25 22JAN2016 30.42
25 25MAR2022 .
;


proc sort data=have;
  by dmrn dx_date;
run;

data non_Missing_windows  /view=non_Missing_windows; 
  set have (rename=(dx_date=nm_dte bmi=nm_bmi) where=(nm_bmi^=.)); /*Read only non-missing BMI's*/
  by dmrn;
  dx_date=nm_dte - ifn(first.dmrn,365,min(365,0.5*dif(nm_dte)));
  format dx_date date9.;
run;

data want (drop=nm_:);
  set have (in=inhave)
      non_missing_windows (in=nmw keep=dmrn dx_date) ;
  by dmrn dx_date;
  if nmw then set non_missing_windows (keep=nm_dte nm_bmi);
  else if first.dmrn then call missing(of nm_:);

  if inhave;
  new_bmi=ifn(bmi=. and dx_date-365<=nm_dte,nm_bmi,bmi);
run;

The code below is the original response, but does not properly process DMRN's that start with a missing BMI:

data have;
input DMRN         dx_date  :date9.             BMI;
format dx_date date9.;
datalines;
30                04FEB2013      35.82
30                04JUN2013        .
30                04NOV2014      26.55
25                 22JAN2016     30.42
25                 25MAR2022      .
run;

data want (drop=prv_:  nxt_:);
  set have;
  by dmrn notsorted;

  if bmi^=. then do;  /*If non-missing BMI, update PRV_ and NXT_ */
    new_bmi=bmi;
    merge have (firstobs=1 rename=(dmrn=prv_dmrn dx_date=prv_dte bmi=prv_bmi) where=(prv_bmi^=.))
          have (firstobs=2 rename=(dmrn=nxt_dmrn dx_date=nxt_dte bmi=nxt_bmi) where=(nxt_bmi^=.)) ;
  end;
  else do;
    /* Eliminate Non-eligible or ancient history*/
    if prv_dmrn^=dmrn or prv_dte<dx_date-365 then call missing(of prv_:);

    /* Copy PREV, if next is too far or for another DMRN  */
    /* Copy NEXT if it is near enough and PREV is missing */
    if nxt_dmrn ^= dmrn or nxt_dte> dx_date+365 then new_bmi=prv_bmi; 
    else if prv_dte=. and nxt_dmrn=dmrn and nxt_dte <= dx_date+365 then new_bmi=nxt_bmi;

    else do;  /*But if Both prev and next qualify, compare distances */
      prv_nxt_mean_dte=mean(prv_dte,nxt_dte);
      if prv_nxt_mean_dte > dx_date then new_bmi=nxt_bmi;
      else new_bmi=prv_bmi;
    end;
  end;
run;

~~I've created a variable NEW_BMI so you can see the before and after date step values of BMI.~~

~~The primary tool here is the conditional use of the MERGE statement.~~

First, because the MERGE is only run conditionally (i.e. when BMI^=.) the values it reads in are automatically "retained" across subsequent observations that have missing BMI. That is, they are retained until the next MERGE.
~~Second, one of the merge datasets is the HAVE dataset, but starting at firstobs=2, which allows access to the next non-missing value of BMI.~~
~~Note both datasets in the MERGE have qualifying WHERE parameters to ensure that only non-missing BMI values are used for the previous and next values.~~

Then, when the current BMI is missing, one can examine the PRV (previous) and NXT non-missing values to get BMI from the closest date. BTW, if the previous and next dates are equally distant, the code above takes the previous BMI.

~~This program assumes that each DMRN is sorted by DX_DATE.~~

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 04-01-2024 03:15 PM

I don't think you have thought through all of the possible combinations of missing BMI values.

For example if the first observation has a missing value of BMI then the NXT_.... variables are not populated since the MERGE never happens.

Not sure if there are others since I did not test every scenario.

mkeintz · Posted 04-01-2024 03:24 PM

@Tom wrote:

I don't think you have thought through all of the possible combinations of missing BMI values.

For example if the first observation has a missing value of BMI then the NXT_.... variables are not populated since the MERGE never happens.

Not sure if there are others since I did not test every scenario.

You're right. Revision needed.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

tan-wongv · Posted 04-02-2024 01:46 AM

Thank you so much for your response!

Assign the same value to identical IDs with date difference less than 365 days

Re: Assign the same value to identical IDs with date difference less than 365 days

Re: Assign the same value to identical IDs with date difference less than 365 days

Re: Assign the same value to identical IDs with date difference less than 365 days

Re: Assign the same value to identical IDs with date difference less than 365 days

Re: Assign the same value to identical IDs with date difference less than 365 days

Re: Assign the same value to identical IDs with date difference less than 365 days

Re: Assign the same value to identical IDs with date difference less than 365 days

Re: Assign the same value to identical IDs with date difference less than 365 days