Solved: Merge/join using a date from one dataset within a date range in anothe...

sasgorilla · Posted 06-23-2025 09:36 AM

I have two data sets (set1 set2) with a common id.

data set1;
    input id $ d_start :date9. d_end :date9.
          physcat $ a b c d e;
    format d_start d_end date9.;
    datalines;
s001 01JAN2020 31DEC2020 A 1 2 3 1 0
s001 01JAN2020 31DEC2020 A 1 3 3 1 0

s001 01JAN2021 31DEC2021 B 1 3 3 2 1
s002 01JUN2019 31MAY2020 A 2 1 2 1 1
s002 01JUN2020 31MAY2021 C 2 2 2 2 0
s003 01JAN2020 31DEC2022 B 1 3 4 1 0
;
run;

data set2;
input id $ eventstart :date9.;
format eventstart date9.;
datalines;
s001 15JUN2020
s001 20JUL2021
s002 15AUG2019
s002 01JUL2020
s003 01JAN2021
s003 01JAN2023
;
run;

I am trying to join specific variables from set1 to set2 (physcat, a, b, c, d, e) by id where eventstart falls within a d_start and d_end date range in set1.

*In the case more than 1 observation from set1 had date ranges that include eventstart, I would want to take the variable values from the observation with the lowest b, c and d. These are temporal categorical variables where lower values would indicate earlier time (i.e. before a birthday or promotion), and they should increase uniformly together or singularly, but not reverse order. In the example above for s001, the first observation has lower values (2 3 1) than the second observation (3 3 1) so I would want the former if eventstart fell in that range. *

The desired dataset would look like this:

data merged;
    input id $ eventstart :date9.
          physcat $ a b c d e;
    format eventstart date9.;
    datalines;
s001 15JUN2020 A 1 2 3 1 0
s001 20JUL2021 B 1 3 3 2 1
s002 15AUG2019 A 2 1 2 1 1
s002 01JUL2020 C 2 2 2 2 0
s003 01JAN2021 B 1 3 4 1 0
s003 01JAN2023 . . . . . .   /* I do not think there should be any cases without matches, but if there were I would want to retain with missing values */
;
run;

What is the simplest way to do this? I have tried a few different joins with proc sql but am obviously making mistakes, as they end up producing more observations for an id than exist in set2. Set2 should maintain the same number of observations prior to merge/join.

Kathryn_SAS · Posted 06-23-2025 11:08 AM

The following gives me your desired table. I created a cartesian join between matching id values and then compared the dates and then selected the first in a BY group.

data set1;
    input id $ d_start :date9. d_end :date9.
          physcat $ a b c d e;
    format d_start d_end date9.;
    datalines;
s001 01JAN2020 31DEC2020 A 1 2 3 1 0
s001 01JAN2020 31DEC2020 A 1 3 3 1 0
s001 01JAN2021 31DEC2021 B 1 3 3 2 1
s002 01JUN2019 31MAY2020 A 2 1 2 1 1
s002 01JUN2020 31MAY2021 C 2 2 2 2 0
s003 01JAN2020 31DEC2022 B 1 3 4 1 0
;
run;

data set2;
input id $ eventstart :date9.;
format eventstart date9.;
datalines;
s001 15JUN2020
s001 20JUL2021
s002 15AUG2019
s002 01JUL2020
s003 01JAN2021
s003 01JAN2023
;
run;

proc sql;
create table combined as 
 select set1.id, set1.d_start, set1.d_end, set1.physcat, set1.a,
  set1.b, set1.c, set1.d, set1.e, set2.eventstart
 from set1, set2
  where set1.id=set2.id;
quit;

data combined;
set combined;
where d_start le eventstart le d_end;
run;

proc sort data=combined;
by id d_start eventstart d_end;
run;

data combined;
set combined;
by id d_start eventstart d_end;
if first.d_end then output;
run;

proc print data=combined;
run;

View solution in original post

Quentin · Posted 06-23-2025 09:40 AM

Please update your question to show the PROC SQL join you have tried. That will help others help you.

sasgorilla · Posted 06-23-2025 09:45 AM

@Quentin Thank you. I was having some formatting issues when copying and pasting code for some reason...

Here is a recent example that resulted in increasing the number of observations in set 2.

proc sql;
    create table best_matches as
    select 
        s2.id,
        s2.eventstart,
        s1.physcat,
        s1.a,
        s1.b,
        s1.c,
        s1.d,
        s1.e,
        s1.d_end
    from set2 as s2
    inner join set1 as s1
        on s1.id = s2.id
        and s2.eventstart between s1.d_start and s1.d_end
    group by s2.id, s2.eventstart
    having s1.d_end = min(s1.d_end)
    ;

    create table merged as
    select 
        s2.*, 
        bm.physcat,
        bm.a, bm.b, bm.c, bm.d, bm.e
    from set2 as s2
    left join best_matches as bm
        on s2.id = bm.id
        and s2.eventstart = bm.eventstart
    ;
quit;

Kathryn_SAS · Posted 06-23-2025 11:08 AM

The following gives me your desired table. I created a cartesian join between matching id values and then compared the dates and then selected the first in a BY group.

data set1;
    input id $ d_start :date9. d_end :date9.
          physcat $ a b c d e;
    format d_start d_end date9.;
    datalines;
s001 01JAN2020 31DEC2020 A 1 2 3 1 0
s001 01JAN2020 31DEC2020 A 1 3 3 1 0
s001 01JAN2021 31DEC2021 B 1 3 3 2 1
s002 01JUN2019 31MAY2020 A 2 1 2 1 1
s002 01JUN2020 31MAY2021 C 2 2 2 2 0
s003 01JAN2020 31DEC2022 B 1 3 4 1 0
;
run;

data set2;
input id $ eventstart :date9.;
format eventstart date9.;
datalines;
s001 15JUN2020
s001 20JUL2021
s002 15AUG2019
s002 01JUL2020
s003 01JAN2021
s003 01JAN2023
;
run;

proc sql;
create table combined as 
 select set1.id, set1.d_start, set1.d_end, set1.physcat, set1.a,
  set1.b, set1.c, set1.d, set1.e, set2.eventstart
 from set1, set2
  where set1.id=set2.id;
quit;

data combined;
set combined;
where d_start le eventstart le d_end;
run;

proc sort data=combined;
by id d_start eventstart d_end;
run;

data combined;
set combined;
by id d_start eventstart d_end;
if first.d_end then output;
run;

proc print data=combined;
run;

sasgorilla · Posted 06-23-2025 11:55 AM

Thanks @Kathryn_SAS for the quick response!

I'm testing if this worked... I lose a lot of observations in set2 when I run your code, but it appears I may have a lot of set2 eventstarts that are outside of the d_start d_end ranges. I did not know that to be the case so I'm working through that now to test.

sasgorilla · Posted 06-24-2025 12:44 PM

Thanks everyone for all the help! I appreciate it!

Tom · Posted 06-23-2025 11:28 AM

It is not clear out you order by multiple variables. Perhaps just sort by them and take the first one?

Something like this?

proc sql;
create table joined as 
  select a.*,b.eventstart 
  from set1 a left join set2 b
  on a.id=b.id 
    and b.eventstart between a.d_start and a.d_end
  order by a.id,b.eventstart,a.physcat,a.a,a.b,a.c,a.d
;
quit;

data want;
  set joined;
  by id eventstart physcat a ;
  if first.a;
run;

FreelanceReinh · Posted 06-23-2025 11:53 AM

Hello @sasgorilla,

I would probably create a temporary variable the sort order of which allows for using a HAVING clause selecting the minimum (or maximum) value of that variable (as a "tie-breaker"). So the new variable would reflect your criterion "the observation with the lowest b, c and d."

Here is an example using _x=b+c+d for that purpose:

proc sql;
create table merged(drop=_x) as
select s2.id, eventstart,
       physcat, a, b, c, d, e,
       b+c+d as _x
from set2 as s2
left join set1 as s1
on s1.id = s2.id and eventstart between d_start and d_end
group by s2.id, eventstart
having _x=min(_x);
quit;

If your interpretation of "the lowest b, c and d" is rather hierarchical, i.e., lowest b is most important, followed by lowest c (if the values of b are equal) and so on, you could define _x as something like 10000*b+100*c+d (assuming non-missing integer values b, c, d between 0 and 99). And many more definitions are possible (e.g., adding criteria for missing values).

sasgorilla · Posted 06-23-2025 11:57 AM

Thanks @FreelanceReinh , that is a great idea! I will test.

I also found there may be a pre-categorized age variable in set1. It's not a DOB or that would be easy to sort by, but an age at differing arbitray period starts. I'm trying to test that out since sorting by that may work as well.

mkeintz · Posted 06-23-2025 03:23 PM

This is a good use case for applying conditional SET statements in a data step.

Assuming:

SET1 is sorted by ID/d_start/b/c/d
SET1 has no instances of overlapping d_start-d_end date ranges
SET2 is sorted by ID/eventstart

then

data set1;
    input id $ d_start :date9. d_end :date9.
          physcat $ a b c d e;
    format d_start d_end date9.;
    datalines;
s001 01JAN2020 31DEC2020 A 1 2 3 1 0
s001 01JAN2020 31DEC2020 A 1 3 3 1 0
s001 01JAN2021 31DEC2021 B 1 3 3 2 1
s002 01JUN2019 31MAY2020 A 2 1 2 1 1
s002 01JUN2020 31MAY2021 C 2 2 2 2 0
s003 01JAN2020 31DEC2022 B 1 3 4 1 0
run;


data set2;
input id $ eventstart :date9.;
format eventstart date9.;
datalines;
s001 15JUN2020
s001 20JUL2021
s002 15AUG2019
s002 01JUL2020
s003 01JAN2021
s003 01JAN2023
run;

data need /view=need  /*Keep lowest B C D for each ID/D_START*/;
  set set1;
  by id d_start b c d;
  if first.d;
run;

data want (drop=_:);
  set need (keep=id d_start in=in1 rename=(d_start=_ref_date))
      set2 (                in=in2 rename=(eventstart=_ref_date));
  by id _ref_date;

  retain _left_sentinel .;
  if in1 then set need ;
  retain _right_sentinel ' ';

  if in2 ;
  set set2;
  if first.id or eventstart>d_end then call missing(of _left_sentinel--_right_sentinel);
run;

This will also accommodate multiple events within a date range.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Merge/join using a date from one dataset within a date range in another

Re: Merge/join using a date from one dataset within a date range in another

Re: Merge/join using a date from one dataset within a date range in another

Re: Merge/join using a date from one dataset within a date range in another

Re: Merge/join using a date from one dataset within a date range in another

Re: Merge/join using a date from one dataset within a date range in another

Re: Merge/join using a date from one dataset within a date range in another

Re: Merge/join using a date from one dataset within a date range in another

Re: Merge/join using a date from one dataset within a date range in another

Re: Merge/join using a date from one dataset within a date range in another

Re: Merge/join using a date from one dataset within a date range in another

Registration is open

Registration is open

SAS Training: Just a Click Away