Matching data by date intervals

SarahW13 · Posted 12-24-2019 01:33 PM

Hello all,

I have 2 datasets and am trying to figure out how to identify the rows in one dataset whose dates fall within the intervals in another dataset.

These are the datasets that I have:

Dataset A:

id SBP date

1 A 1/10/15

1 B 2/10/15

1 C 3/10/15

1 D 4/10/15

2 A 3/15/17

2 B 6/15/17

Dataset B:

id med start_date end_date

1 GroupA 1/1/15 3/1/15

1 GroupB 3/5/15 5/1/15

2 GroupA 1/15/17 2/15/17

2 GroupB 2/1/17 5/1/17

This is what I want:

id SBP date med start_date end_date

1 A 1/10/15 GroupA 1/1/15 3/1/15

1 B 2/10/15 GroupA 1/1/15 3/1/15

1 C 3/10/15 GroupB 3/5/15 5/1/15

1 D 4/10/15 GroupB 3/5/15 5/1/15

2 A 3/15/17 GroupB 2/1/17 5/1/17

2 B 6/15/17 - - -

Essentially I am trying to identify the rows from dataset A whose dates fall within the time intervals from dataset B. (Note that id#2's SBP of B did not fall within any of the date intervals from dataset B.)

Would appreciate any advice for how to do this!

novinosrin · Posted 12-24-2019 02:07 PM

data one;
input id   SBP $     date :mmddyy10.;
format date mmddyy10.;
cards;
1                     A          1/10/15
1                     B          2/10/15
1                     C          3/10/15
1                     D          4/10/15
2                     A          3/15/17
2                     B          6/15/17
;

data two;
input id med $ (start_date    end_date) (:mmddyy10.);
format start_date    end_date mmddyy10.;
cards;
1                   GroupA      1/1/15          3/1/15
1                    GroupB      3/5/15          5/1/15
2                    GroupA      1/15/17        2/15/17
2                    GroupB      2/1/17          5/1/17
;

proc sql;
create table want as
select a.*,med, start_date, end_date
from one a left join two b
on a.id=b.id and start_date<=date<=end_date;
quit;

novinosrin · Posted 12-24-2019 02:12 PM

Damn! the Proc SQL requires an ORDER BY


proc sql;
create table want as
select a.*,med, start_date, end_date
from one a left join two b
on a.id=b.id and start_date<=date<=end_date
order by a.id,sbp;
quit;

mkeintz · Posted 12-24-2019 07:11 PM

If your data set are sorted chronologically within ID, then this MERGE + conditional SETs will be faster than proc sql:

data EVENT;
input id   SBP $     date :mmddyy10.;
format date mmddyy10.;
cards;
1                     A          1/10/15
1                     B          2/10/15
1                     C          3/10/15
1                     D          4/10/15
2                     A          3/15/17
2                     B          6/15/17
run;

data RANGE;
input id med $ (start_date    end_date) (:mmddyy10.);
format start_date    end_date mmddyy10.;
cards;
1                   GroupA       1/1/15          3/1/15
1                    GroupB      3/5/15          5/1/15
2                    GroupA      1/15/17        2/15/17
2                    GroupB      2/1/17          5/1/17
run;


data want (drop=_:);
  set   EVENT (keep=id date       rename=(date=_sort_date)        in=inE)
        RANGE (keep=id start_date rename=(start_date=_sort_date)  in=inR);
  by id _sort_date;

  if inE then set EVENT;

  retain _pre_R .;
  if inR then set RANGE;
  retain _post_R .;

  if inE;
  if date < start_date or date > end_date then call missing(of _pre_R -- _post_R);
  output;
  if last.id then call missing(of _all_);

run;

This relies on the fact that any variable read in by the SET (or the MERGE statement) is, unlike newly created variables, retained by default. But since SET statements are usually executed in each and every iteration of the data step, you don't notice, since new values universally replace the preceding observations.

But in the above, only the ID and _SORT_DATE variables (renamed from DATE and START_DATE) are read in unconditionally. The subsequent IF ... THEN SET statements read in the corresponding dataset and its variables only conditionally. Those variables are retained until the condition is satisfied again. In particular the RANGE dataset will have it variables retained until the _SORT_DATE is populated by the next RANGE observation.

Then the only thing to worry about is the possibility of "stale" range variables (i.e. the event date is out of range). That's where the _PRE_R and _POST_R variables are used as sentinels, making them excellent arguments in the call missing (of ...) statement, where stale variables are set to missing.

Edited additional notes:

I should add that this program assumes that none of the ID's have observations in the RANGE dataset with overlapping ranges. I.e. START_DATE for observation i is always sometime after END_DATE for the previous observation for any given id.
I added the last two lines of the program (the explicit OUTPUT statement followed by the "if last.id then call missing ..." statement) to address situations in which the final range in one ID is followed by an EVENT starting the next ID which falls within that range (i.e. it precedes any ranges for it own id).

For those attending next years Global Forum, I will be doing a paper on exactly this sort of mixing of time series.

And I should add. Regarding the comparson to PROC SQL. Not only is this faster for sorted data sets, it is a LOT easier to extend to mixing 3 or more time series. Just think about having datasets YEAR, QTR, and MONTH, sorted by ID and date. And you want a monthly file with the most recent YEAR and QTR data, but possibly excluding YEAR or QTR data that are older than, say 24 months.

Or say you have irregular data sets ADMISSIONS, LABTESTS, and VISITS, also sorted by ID/date. You want a record for each event of any type, carrying the most recent data from the other types. MERGE by sort key, plus conditional SETs solves this problem.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

novinosrin · Posted 12-24-2019 07:33 PM

Very nice @mkeintz . Always learning.learned tonnes from you. However I am intrigued to think that MAGIC=102 aka SORT/MERGE join algortihm can be forced for SQL optimizer for small to medium size tables eliminating the potential nested loop through the tables one by one to be chosen by the optimizer. This could perhaps match or speed more surprisingly than a datastep.

proc sql magic=102;
create table want as
select a.*,med, start_date, end_date
from one a left join two b
on a.id=b.id and start_date<=date<=end_date;
quit;

The scope of datastep being a work horse is of course wide considering the utility but should the requirement limit to just what is stated and nothing beyond, the magic may really do super magic 🙂

mkeintz · Posted 12-24-2019 07:48 PM

Assuming the data sets are already sorted, I'd be interested in seeing situations in which the SQL solution is faster than MERGE plus conditional SETs, even if it's only 2 time series.

Does the SQL proc really take advantage of the fact that the data sets are sorted? If not, then it has to either do its own sort (presumably using a lot more memory) or it has to do a cartesian comparison of DATE against every START_DATE/END_DATE pair with a matching id. And in either case it likely needs more memory.

ps: I had to look up "magic=102". And Google helpfully provided a number of fm radio stations - had to include "sql" in the search phrase.

We must stand up to atavistic urges to immediately resort to proc sql. It can be good to be one with the data step.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

novinosrin · Posted 12-24-2019 07:55 PM

I agree. I am keen to test, observe and learn in 2 different environments i.e.

1. If connecting a third party database, explicit vs implicit(libname) and the magic vs sorted source datasets in a datastep as you mentioned where we push the query into the database and still force the desired algorithm.

2. General SAS datasets, I think here yours should be the outright leader nonetheless, I would like to test

At this point, I honestly do not know the outcome to test as I do not have that environment to play with. But indeed a very interesting subject.

PS If there datasets were to be indexed by JOIN keys, more fun to watch for I guess

Matching data by date intervals

Re: Matching data by date intervals

Re: Matching data by date intervals

Re: Matching data by date intervals

Re: Matching data by date intervals

Re: Matching data by date intervals

Re: Matching data by date intervals

Registration is open

Matching data by date intervals

Re: Matching data by date intervals

Re: Matching data by date intervals

Re: Matching data by date intervals

Re: Matching data by date intervals

Re: Matching data by date intervals

Re: Matching data by date intervals

Registration is open

SAS Training: Just a Click Away