I'm having trouble merging datasets on date ranges.
A couple problems encountered are 1) only selecting the first observation per dataset that meets the criteria, and 2) not duplicating that observation on other rows.
In this sample data, any follow-up (fu) and medication (med) observations which occur within 30 days of the OP date, would be merged once and then not used again. There are several datasets like this, so sequential merge/join statements would work fine.
data optest; input id op date_op MMDDYY10.; format date_op MMDDYY10.; datalines;
1 1 02/12/2020
1 1 02/13/2020
1 1 03/03/2021
1 1 03/05/2021
1 1 03/11/2021
;run;
proc print data=optest; run;
data futest; input id fu date_fu MMDDYY10.; format date_fu MMDDYY10.; datalines;
1 1 03/05/2020
1 1 03/16/2021
1 1 01/16/2022
;run;
proc print data=futest; run;
data medtest; input id med date_med MMDDYY10.; format date_med MMDDYY10.; datalines;
1 1 03/25/2021
;run;
proc print data=medtest; run;
Want:
id op date_op fu date_fu med date_med
1 1 02/12/2020 1 03/05/2020
1 1 02/13/2020
1 1 03/03/2021 1 03/16/2021 1 03/25/2021
1 1 03/05/2021
1 1 03/11/2021
1 1 01/16/2022
Looks like this is the same as this post
https://communities.sas.com/t5/SAS-Programming/Merging-Data-on-Closest-Date-Values-without-replaceme...
Two different OP's asking the same question with the same data?
Is this a homework assignment?
The situation is that i have 3 datasets that are all logging different things about a subject. The initial visit, follow up visit and medication prescribe date. I need to make a "best guess" as to when the follow up visit corresponds to the initial and when meds corresponds to the follow up (or initial if follow up is unavailable.) It cant be the absolute value of the closest date because a follow up can not happen BEFORE the initial and meds cant be prescribed before a visit. Also, once a follow up visit have been used to proceed an initial it can not be used again, so I need a way to exclude if from being considered further.
I have tried a few different approaches with merges, sets and SQLs but have not found anything that will get me to where i want to be. Does anyone have experience with this kind of merge or filtering.
I am open to all possibilities so if there is a better path feel free.
Want:
id op date_op fu date_fu med date_med
1 1 02/12/2020 1 03/05/2020
1 1 02/13/2020
1 1 03/03/2021 1 03/16/2021 1 03/25/2021
1 1 03/05/2021
1 1 03/11/2021
1 1 01/16/2022
data optest; input id op date_op MMDDYY10.; format date_op MMDDYY10.; datalines; 1 1 02/12/2020 1 1 02/13/2020 1 1 03/03/2021 1 1 03/05/2021 1 1 03/11/2021 ;run; proc print data=optest; run; data futest; input id fu date_fu MMDDYY10.; format date_fu MMDDYY10.; datalines; 1 1 03/05/2020 1 1 03/16/2021 1 1 01/16/2022 ;run; proc print data=futest; run; data medtest; input id med date_med MMDDYY10.; format date_med MMDDYY10.; datalines; 1 1 03/25/2021 ;run; proc print data=medtest; run;
I think this can best done by a data step. But you need to establish a few rules of assignment of an futest and/or medtest record to a specific optest record.
For instance:
This is the same question as
Merging Inexact Dates on Range with Multiple Obser... - SAS Support Communities
Why are two OP's asking the same question?
@GreyJoy wrote:
Possibly because they are working on the same problem and both had the idea of asking SAS communities for help.
Do you have any helpful input or just making random comments on posts?
Then you both came to the right place. But since you are working on the same problem, it will be better to work on this merged thread.
I'll repeat a couple of questions that I still have:
OK. I've merged the other thread into this one.
You have answered one of the questions I posted on the other thread - namely that linked DATE_FU and DATE_MED dates must be within 30 days of the DATE_OP. So 30 days gap makes a potential link "stale".
The other question is why is the 03/05/2020 FU_DATE linked to the 02/12/2020 DATE_OP, instead of the 02/13/2020 DATE_OP, which is closer? Is the rule to associate a DATE_FU and/or DATE_MED with the earliest preceding non-stale DATE_OP ?
Thanks for merging these threads.
Theoretically, the FU and MED would be linked to the earliest non-stale (30 days) OP date, because potentially this is the participant's first of several OP encounters.
However, in the absence of a reasonably straightforward solution, the closest OP to FU date is acceptable.
To address your previous questions:
1. A DATE_FU can equal or follow a DATE_OP.
2. A DATE_MED cannot precede a DATE_FU, but can be linked to the same DATE_OP, if for instance, they are all on the same date.
Generally speaking: OP <= FU <= MED
3. Answered above, but to reiterate, DATE_FU & DATE_MED are linked to the first non-stale OP instance, but the closest preceding OP is allowable.
@mkeintz wrote:
- Must a DATE_FU actually follow a DATE_PO, or is a link between equal dates allowed?
- Can a DATE_MED precede a DATE_FU and be linked to the same DATE_PO?
- The first DATE_FU in your sample data could be linked to either the first or second DATE_PO. It appears from your sample WANT outcome, that you have a rule to match a DATE_FU to the earliest preceding DATE_PO, even if there is a closer preceding DATE_PO. Is that correct?
If you could build a linktable that looks like this:
Link | OP_dummy | FU_dummy | MED_dummy |
1 | 1 | 1 | |
2 | 1 | ||
3 | 1 | 1 | 1 |
4 | 1 | ||
5 | 1 | ||
6 | 1 |
You could loop over the rows (L=1 to 6) with logic like this, where _nlink=6:
do L=1 to _nlink;
call missing (of _sentinel1--_sentinel2);
if _linkdates{L,1}^=. then set optest;
if _linkdates{L,2}^=. then set futest;
if _linkdates{L,3}^=. then set medtest;
output;
end;
The above assumes the _sentinel1 variable is to the left of the variables from the 3 datasets, and _sentinel2 to the right, and the two-way array _linkdates includes only the rightmost 3 columns above.
So the task obviously is to build that link table in a first pass of each ID, so that the second pass could use the code above: The below does that:
data optest; input id op date_op MMDDYY10.; format date_op MMDDYY10.; datalines;
1 1 02/12/2020
1 1 02/13/2020
1 1 03/03/2021
1 1 03/05/2021
1 1 03/11/2021
;run;
data futest; input id fu date_fu MMDDYY10.; format date_fu MMDDYY10.; datalines;
1 1 03/05/2020
1 1 03/16/2021
1 1 01/16/2022
;run;
data medtest; input id med date_med MMDDYY10.; format date_med MMDDYY10.; datalines;
1 1 03/25/2021
data want (drop=_: L) ;
retain _sentinel1 . ;
if 0 then set optest futest medtest;
retain _sentinel2 . ;
array _linkdates {40,3}; *Accomodate up to 40 links of 3 dates (date_op,date_fu,date_med)**;
_nlink=0;
do until (last.id);
set optest (keep=id date_op rename=(date_op=_srtdate) in=inop)
futest (keep=id date_fu rename=(date_fu=_srtdate) in=infu)
medtest (keep=id date_med rename=(date_med=_srtdate) in=inmd) ;
by id _srtdate;
if inop then do;
_nlink+1; **New link**;
_linkdates{_nlink,1}=_srtdate; ** Renamed from DATE_OP **;
end;
else if infu then do;
if _nlink=0 then L=1; **In case FUTEST is 1st record for this id**;
else do L=1 to _nlink until (_srtdate-30 <= _linkdates{L,1});
end;
if L>_nlink then _nlink=L; **New link**;
_linkdates{L,2}=_srtdate; *DATE_FU in second column;
end;
else if inmd then do;
if _nlink=0 then L=1; **In case MEDTEST is 1st record for this id**;
else do L=1 to _nlink until (_srtdate-30 <= _linkdates{L,1} and _srtdate>=_linkdates{L,2});
end;
if L>_nlink then _nlink=L; **New link**;
_linkdates{L,3}=_srtdate; *DATE_MED in third column;
end;
end;
do L=1 to _nlink;
call missing (of _sentinel1--_sentinel2);
if _linkdates{L,1}^=. then set optest;
if _linkdates{L,2}^=. then set futest;
if _linkdates{L,3}^=. then set medtest;
output;
end;
run;
Be sure to set the row size of the _linkdates array to accomodated the largest number of linkages expected for any single ID.
Note the else do loops have no internal code. They just have a stopping condition based on the date relationships you posted. If none of the pre-existing links satisfy the stopping condition, then the index variable L will be exactly 1 greater than the prior _nlinks value - indicating need for a new link id.
Actually the _linkdates table in the program doesn't used dummy variables in the tables cells. It uses date values.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.