BookmarkSubscribeRSS Feed
moreka
Obsidian | Level 7

I'm having trouble merging datasets on date ranges. 

 

A couple problems encountered are 1) only selecting the first observation per dataset that meets the criteria, and 2) not duplicating that observation on other rows.

 

In this sample data, any follow-up (fu) and medication (med) observations which occur within 30 days of the OP date, would be merged once and then not used again. There are several datasets like this, so sequential merge/join statements would work fine.

 

data optest; input id op date_op MMDDYY10.; format date_op MMDDYY10.; datalines;
1 1 02/12/2020
1 1 02/13/2020
1 1 03/03/2021
1 1 03/05/2021
1 1 03/11/2021
;run;
proc print data=optest; run;
 
data futest; input id fu date_fu MMDDYY10.; format date_fu MMDDYY10.; datalines;
1 1 03/05/2020
1 1 03/16/2021
1 1 01/16/2022
;run;
proc print data=futest; run;
 
data medtest; input id med date_med MMDDYY10.; format date_med MMDDYY10.; datalines;
1 1 03/25/2021
;run;
proc print data=medtest; run;
 

 

Want:

 

 id    op    date_op        fu     date_fu            med     date_med

 

  1     1    02/12/2020    1      03/05/2020

  1     1    02/13/2020   

  1     1    03/03/2021    1      03/16/2021       1       03/25/2021

  1     1    03/05/2021

  1     1    03/11/2021

  1                                 1      01/16/2022

 

11 REPLIES 11
mkeintz
PROC Star

Two different OP's asking the same question with the same data?

 

Is this a homework assignment?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
moreka
Obsidian | Level 7
Nah, it's a problem I've been dealing with for a while and reached out with an example. He also worked on it for a while and encountered some of the same issues. I didn't know he'd posted about it earlier today.
GreyJoy
Obsidian | Level 7

The situation is that i have 3 datasets that are all logging different things about a subject. The initial visit, follow up visit and medication prescribe date. I need to make a "best guess" as to when the follow up visit corresponds to the initial and when meds corresponds to the follow up (or initial if follow up is unavailable.) It cant be the absolute value of the closest date because a follow up can not happen BEFORE the initial and meds cant be prescribed before a visit. Also, once a follow up visit have been used to proceed an initial it can not be used again, so I need a way to exclude if from being considered further. 

 

I have tried a few different approaches with merges, sets and SQLs but have not found anything that will get me to where i want to be. Does anyone have experience with this kind of merge or filtering. 

 

I am open to all possibilities so if there is a better path feel free.  

 


Want:
id   op     date_op      fu   date_fu       med   date_med

1     1  02/12/2020     1 03/05/2020
1     1 02/13/2020
1     1 03/03/2021      1 03/16/2021      1 03/25/2021
1     1 03/05/2021
1     1 03/11/2021
1                                1  01/16/2022

 

 

data optest; 
input id op date_op MMDDYY10.; format date_op MMDDYY10.;
datalines;
1 1 02/12/2020
1 1 02/13/2020
1 1 03/03/2021
1 1 03/05/2021
1 1 03/11/2021
;run;
proc print data=optest; run;
 
 
data futest; 
input id fu date_fu MMDDYY10.; format date_fu MMDDYY10.;
datalines;
1 1 03/05/2020
1 1 03/16/2021
1 1 01/16/2022
;run;
proc print data=futest; run;
 
 
data medtest; 
input id med date_med MMDDYY10.; format date_med MMDDYY10.;
datalines;
1 1 03/25/2021
;run;
proc print data=medtest; run;
 
 
mkeintz
PROC Star

I think this can best done by a data step.  But you need to establish a few rules of assignment of an futest and/or medtest record to a specific optest record.

 

For instance:

  1. Why is the 03/05/2020 futest record assigned to the 02/12/2020 optest record, instead of the closer 02/13/2020 optest record?   How far apart must the optest records be to use the one closer to the futest record?
  2. How long an interval must there be for a potential assignment to become "stale"?   You have your last futest (1/16/2022) record assigned to no optest record, presumably because the latest preceding optest record is ten months prior (3/11/2021).
  3. Can a date_med, while following a date_op, precede the associated date_fu?
  4. Can a date_fu  (or date_med) be equal to a date_op, or must it strictly follow date_op?   If it must strictly follow, is a single day later sufficient?

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

This is the same question as 

Merging Inexact Dates on Range with Multiple Obser... - SAS Support Communities

 

Why are two OP's asking the same question?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
GreyJoy
Obsidian | Level 7
Possibly because they are working on the same problem and both had the idea of asking SAS communities for help.

Do you have any helpful input or just making random comments on posts?
mkeintz
PROC Star

@GreyJoy wrote:
Possibly because they are working on the same problem and both had the idea of asking SAS communities for help.

Do you have any helpful input or just making random comments on posts?

Then you both came to the right place.  But since you are working on the same problem, it will be better to work on this merged thread.

 

I'll repeat a couple of questions that I still have:

 

  1. Must a DATE_FU actually follow a DATE_PO, or is a link between equal dates allowed?

  2. Can a DATE_MED precede a DATE_FU and be linked to the same DATE_PO?

  3. The first DATE_FU in your sample data could be linked to either the first or second DATE_PO.  It appears from your sample WANT outcome, that you have a rule to match a DATE_FU to the earliest preceding DATE_PO, even if there is a closer preceding DATE_PO.  Is that correct?

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

@moreka 

 

OK.  I've merged the other thread into this one.

 

You have answered one of the questions I posted on the other thread - namely that linked DATE_FU and DATE_MED dates must be within 30 days of the DATE_OP.  So 30 days gap makes a potential link "stale".

 

The other question is why is the 03/05/2020 FU_DATE linked to the 02/12/2020 DATE_OP, instead of the 02/13/2020 DATE_OP, which is closer?   Is the rule to associate a DATE_FU and/or DATE_MED with the earliest preceding  non-stale DATE_OP ?

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
moreka
Obsidian | Level 7

Thanks for merging these threads.

 

Theoretically, the FU and MED would be linked to the earliest non-stale (30 days) OP date, because potentially this is the participant's first of several OP encounters.

 

However, in the absence of a reasonably straightforward solution, the closest OP to FU date is acceptable.

 

To address your previous questions:

1. A DATE_FU can equal or follow a DATE_OP.

2. A DATE_MED cannot precede a DATE_FU, but can be linked to the same DATE_OP, if for instance, they are all on the same date.

  Generally speaking:    OP <= FU <= MED

3. Answered above, but to reiterate, DATE_FU & DATE_MED are linked to the first non-stale OP instance, but the closest preceding OP is allowable.

 


@mkeintz wrote:
  1. Must a DATE_FU actually follow a DATE_PO, or is a link between equal dates allowed?
  2. Can a DATE_MED precede a DATE_FU and be linked to the same DATE_PO?
  3. The first DATE_FU in your sample data could be linked to either the first or second DATE_PO.  It appears from your sample WANT outcome, that you have a rule to match a DATE_FU to the earliest preceding DATE_PO, even if there is a closer preceding DATE_PO.  Is that correct?

 


 

 

mkeintz
PROC Star

If you could build a linktable that looks like this:

 

Link OP_dummy FU_dummy MED_dummy
1 1 1  
2 1    
3 1 1 1
4 1    
5 1    
6   1  

 

You could loop over the rows (L=1 to 6) with logic like this, where _nlink=6:

 

  do L=1 to _nlink;
    call missing (of _sentinel1--_sentinel2);
    if _linkdates{L,1}^=. then set optest;
    if _linkdates{L,2}^=. then set futest;
    if _linkdates{L,3}^=. then set medtest;
    output;
  end;

The above assumes the _sentinel1 variable is to the left of the variables from the 3 datasets, and _sentinel2 to the right, and the two-way array _linkdates includes only the rightmost 3 columns above.

 

So the task obviously is to build that link table in a first pass of each ID, so that the second pass could use the code above:  The below does that:

 

data optest; input id op date_op MMDDYY10.; format date_op MMDDYY10.; datalines;
1 1 02/12/2020
1 1 02/13/2020
1 1 03/03/2021
1 1 03/05/2021
1 1 03/11/2021
;run;
 
data futest; input id fu date_fu MMDDYY10.; format date_fu MMDDYY10.; datalines;
1 1 03/05/2020
1 1 03/16/2021
1 1 01/16/2022
;run;
 
data medtest; input id med date_med MMDDYY10.; format date_med MMDDYY10.; datalines;
1 1 03/25/2021

data want (drop=_: L) ;
  retain _sentinel1 . ;
  if 0 then set optest futest medtest;
  retain _sentinel2 . ;

  array _linkdates {40,3};     *Accomodate up to 40 links of 3 dates (date_op,date_fu,date_med)**;

  _nlink=0;
  do until (last.id);
    set optest  (keep=id date_op  rename=(date_op=_srtdate)  in=inop)
        futest  (keep=id date_fu  rename=(date_fu=_srtdate)  in=infu)
        medtest (keep=id date_med rename=(date_med=_srtdate) in=inmd) ;
    by id _srtdate;

    if inop then do;
      _nlink+1;     **New link**;
      _linkdates{_nlink,1}=_srtdate;   ** Renamed from DATE_OP **;
    end;
    else if infu then do;
      if _nlink=0 then L=1;           **In case FUTEST is 1st record for this id**;
      else do L=1 to _nlink until (_srtdate-30 <= _linkdates{L,1});
      end;
      if L>_nlink then _nlink=L;   **New link**;
      _linkdates{L,2}=_srtdate;    *DATE_FU in second column;
    end;

    else if inmd then do;
      if _nlink=0 then L=1;           **In case MEDTEST is 1st record for this id**;
      else do L=1 to _nlink until (_srtdate-30 <= _linkdates{L,1} and _srtdate>=_linkdates{L,2});
      end;
      if L>_nlink then _nlink=L;   **New link**;
      _linkdates{L,3}=_srtdate;   *DATE_MED in third column;
    end; 
  end;

  do L=1 to _nlink;
    call missing (of _sentinel1--_sentinel2);
    if _linkdates{L,1}^=. then set optest;
    if _linkdates{L,2}^=. then set futest;
    if _linkdates{L,3}^=. then set medtest;
    output;
  end;
run;

Be sure to set the row size of the _linkdates array to accomodated the largest number of linkages expected for any single ID.

 

Note the else do loops have no internal code.  They just have a stopping condition based on the date relationships you posted.  If none of the pre-existing links satisfy the stopping condition, then the index variable L will be exactly 1 greater than the prior _nlinks value - indicating need for a new link id.

 

Actually the _linkdates table in the program doesn't used dummy variables in the tables cells.  It uses date values.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 2737 views
  • 3 likes
  • 4 in conversation