I have two data sets (set1 set2) with a common id.
data set1;
input id $ d_start :date9. d_end :date9.
physcat $ a b c d e;
format d_start d_end date9.;
datalines;
s001 01JAN2020 31DEC2020 A 1 2 3 1 0
s001 01JAN2020 31DEC2020 A 1 3 3 1 0
s001 01JAN2021 31DEC2021 B 1 3 3 2 1
s002 01JUN2019 31MAY2020 A 2 1 2 1 1
s002 01JUN2020 31MAY2021 C 2 2 2 2 0
s003 01JAN2020 31DEC2022 B 1 3 4 1 0
;
run;
data set2;
input id $ eventstart :date9.;
format eventstart date9.;
datalines;
s001 15JUN2020
s001 20JUL2021
s002 15AUG2019
s002 01JUL2020
s003 01JAN2021
s003 01JAN2023
;
run;
I am trying to join specific variables from set1 to set2 (physcat, a, b, c, d, e) by id where eventstart falls within a d_start and d_end date range in set1.
*In the case more than 1 observation from set1 had date ranges that include eventstart, I would want to take the variable values from the observation with the lowest b, c and d. These are temporal categorical variables where lower values would indicate earlier time (i.e. before a birthday or promotion), and they should increase uniformly together or singularly, but not reverse order. In the example above for s001, the first observation has lower values (2 3 1) than the second observation (3 3 1) so I would want the former if eventstart fell in that range. *
The desired dataset would look like this:
data merged;
input id $ eventstart :date9.
physcat $ a b c d e;
format eventstart date9.;
datalines;
s001 15JUN2020 A 1 2 3 1 0
s001 20JUL2021 B 1 3 3 2 1
s002 15AUG2019 A 2 1 2 1 1
s002 01JUL2020 C 2 2 2 2 0
s003 01JAN2021 B 1 3 4 1 0
s003 01JAN2023 . . . . . . /* I do not think there should be any cases without matches, but if there were I would want to retain with missing values */
;
run;
What is the simplest way to do this? I have tried a few different joins with proc sql but am obviously making mistakes, as they end up producing more observations for an id than exist in set2. Set2 should maintain the same number of observations prior to merge/join.
The following gives me your desired table. I created a cartesian join between matching id values and then compared the dates and then selected the first in a BY group.
data set1;
input id $ d_start :date9. d_end :date9.
physcat $ a b c d e;
format d_start d_end date9.;
datalines;
s001 01JAN2020 31DEC2020 A 1 2 3 1 0
s001 01JAN2020 31DEC2020 A 1 3 3 1 0
s001 01JAN2021 31DEC2021 B 1 3 3 2 1
s002 01JUN2019 31MAY2020 A 2 1 2 1 1
s002 01JUN2020 31MAY2021 C 2 2 2 2 0
s003 01JAN2020 31DEC2022 B 1 3 4 1 0
;
run;
data set2;
input id $ eventstart :date9.;
format eventstart date9.;
datalines;
s001 15JUN2020
s001 20JUL2021
s002 15AUG2019
s002 01JUL2020
s003 01JAN2021
s003 01JAN2023
;
run;
proc sql;
create table combined as
select set1.id, set1.d_start, set1.d_end, set1.physcat, set1.a,
set1.b, set1.c, set1.d, set1.e, set2.eventstart
from set1, set2
where set1.id=set2.id;
quit;
data combined;
set combined;
where d_start le eventstart le d_end;
run;
proc sort data=combined;
by id d_start eventstart d_end;
run;
data combined;
set combined;
by id d_start eventstart d_end;
if first.d_end then output;
run;
proc print data=combined;
run;
Please update your question to show the PROC SQL join you have tried. That will help others help you.
@Quentin Thank you. I was having some formatting issues when copying and pasting code for some reason...
Here is a recent example that resulted in increasing the number of observations in set 2.
proc sql;
create table best_matches as
select
s2.id,
s2.eventstart,
s1.physcat,
s1.a,
s1.b,
s1.c,
s1.d,
s1.e,
s1.d_end
from set2 as s2
inner join set1 as s1
on s1.id = s2.id
and s2.eventstart between s1.d_start and s1.d_end
group by s2.id, s2.eventstart
having s1.d_end = min(s1.d_end)
;
create table merged as
select
s2.*,
bm.physcat,
bm.a, bm.b, bm.c, bm.d, bm.e
from set2 as s2
left join best_matches as bm
on s2.id = bm.id
and s2.eventstart = bm.eventstart
;
quit;
The following gives me your desired table. I created a cartesian join between matching id values and then compared the dates and then selected the first in a BY group.
data set1;
input id $ d_start :date9. d_end :date9.
physcat $ a b c d e;
format d_start d_end date9.;
datalines;
s001 01JAN2020 31DEC2020 A 1 2 3 1 0
s001 01JAN2020 31DEC2020 A 1 3 3 1 0
s001 01JAN2021 31DEC2021 B 1 3 3 2 1
s002 01JUN2019 31MAY2020 A 2 1 2 1 1
s002 01JUN2020 31MAY2021 C 2 2 2 2 0
s003 01JAN2020 31DEC2022 B 1 3 4 1 0
;
run;
data set2;
input id $ eventstart :date9.;
format eventstart date9.;
datalines;
s001 15JUN2020
s001 20JUL2021
s002 15AUG2019
s002 01JUL2020
s003 01JAN2021
s003 01JAN2023
;
run;
proc sql;
create table combined as
select set1.id, set1.d_start, set1.d_end, set1.physcat, set1.a,
set1.b, set1.c, set1.d, set1.e, set2.eventstart
from set1, set2
where set1.id=set2.id;
quit;
data combined;
set combined;
where d_start le eventstart le d_end;
run;
proc sort data=combined;
by id d_start eventstart d_end;
run;
data combined;
set combined;
by id d_start eventstart d_end;
if first.d_end then output;
run;
proc print data=combined;
run;
Thanks @Kathryn_SAS for the quick response!
I'm testing if this worked... I lose a lot of observations in set2 when I run your code, but it appears I may have a lot of set2 eventstarts that are outside of the d_start d_end ranges. I did not know that to be the case so I'm working through that now to test.
Thanks everyone for all the help! I appreciate it!
It is not clear out you order by multiple variables. Perhaps just sort by them and take the first one?
Something like this?
proc sql;
create table joined as
select a.*,b.eventstart
from set1 a left join set2 b
on a.id=b.id
and b.eventstart between a.d_start and a.d_end
order by a.id,b.eventstart,a.physcat,a.a,a.b,a.c,a.d
;
quit;
data want;
set joined;
by id eventstart physcat a ;
if first.a;
run;
Hello @sasgorilla,
I would probably create a temporary variable the sort order of which allows for using a HAVING clause selecting the minimum (or maximum) value of that variable (as a "tie-breaker"). So the new variable would reflect your criterion "the observation with the lowest b, c and d."
Here is an example using _x=b+c+d for that purpose:
proc sql; create table merged(drop=_x) as select s2.id, eventstart, physcat, a, b, c, d, e, b+c+d as _x from set2 as s2 left join set1 as s1 on s1.id = s2.id and eventstart between d_start and d_end group by s2.id, eventstart having _x=min(_x); quit;
If your interpretation of "the lowest b, c and d" is rather hierarchical, i.e., lowest b is most important, followed by lowest c (if the values of b are equal) and so on, you could define _x as something like 10000*b+100*c+d (assuming non-missing integer values b, c, d between 0 and 99). And many more definitions are possible (e.g., adding criteria for missing values).
Thanks @FreelanceReinh , that is a great idea! I will test.
I also found there may be a pre-categorized age variable in set1. It's not a DOB or that would be easy to sort by, but an age at differing arbitray period starts. I'm trying to test that out since sorting by that may work as well.
This is a good use case for applying conditional SET statements in a data step.
Assuming:
then
data set1;
input id $ d_start :date9. d_end :date9.
physcat $ a b c d e;
format d_start d_end date9.;
datalines;
s001 01JAN2020 31DEC2020 A 1 2 3 1 0
s001 01JAN2020 31DEC2020 A 1 3 3 1 0
s001 01JAN2021 31DEC2021 B 1 3 3 2 1
s002 01JUN2019 31MAY2020 A 2 1 2 1 1
s002 01JUN2020 31MAY2021 C 2 2 2 2 0
s003 01JAN2020 31DEC2022 B 1 3 4 1 0
run;
data set2;
input id $ eventstart :date9.;
format eventstart date9.;
datalines;
s001 15JUN2020
s001 20JUL2021
s002 15AUG2019
s002 01JUL2020
s003 01JAN2021
s003 01JAN2023
run;
data need /view=need /*Keep lowest B C D for each ID/D_START*/;
set set1;
by id d_start b c d;
if first.d;
run;
data want (drop=_:);
set need (keep=id d_start in=in1 rename=(d_start=_ref_date))
set2 ( in=in2 rename=(eventstart=_ref_date));
by id _ref_date;
retain _left_sentinel .;
if in1 then set need ;
retain _right_sentinel ' ';
if in2 ;
set set2;
if first.id or eventstart>d_end then call missing(of _left_sentinel--_right_sentinel);
run;
This will also accommodate multiple events within a date range.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.