Hello all,
I have 2 datasets and am trying to figure out how to identify the rows in one dataset whose dates fall within the intervals in another dataset.
These are the datasets that I have:
Dataset A:
id SBP date
1 A 1/10/15
1 B 2/10/15
1 C 3/10/15
1 D 4/10/15
2 A 3/15/17
2 B 6/15/17
Dataset B:
id med start_date end_date
1 GroupA 1/1/15 3/1/15
1 GroupB 3/5/15 5/1/15
2 GroupA 1/15/17 2/15/17
2 GroupB 2/1/17 5/1/17
This is what I want:
id SBP date med start_date end_date
1 A 1/10/15 GroupA 1/1/15 3/1/15
1 B 2/10/15 GroupA 1/1/15 3/1/15
1 C 3/10/15 GroupB 3/5/15 5/1/15
1 D 4/10/15 GroupB 3/5/15 5/1/15
2 A 3/15/17 GroupB 2/1/17 5/1/17
2 B 6/15/17 - - -
Essentially I am trying to identify the rows from dataset A whose dates fall within the time intervals from dataset B. (Note that id#2's SBP of B did not fall within any of the date intervals from dataset B.)
Would appreciate any advice for how to do this!
data one;
input id SBP $ date :mmddyy10.;
format date mmddyy10.;
cards;
1 A 1/10/15
1 B 2/10/15
1 C 3/10/15
1 D 4/10/15
2 A 3/15/17
2 B 6/15/17
;
data two;
input id med $ (start_date end_date) (:mmddyy10.);
format start_date end_date mmddyy10.;
cards;
1 GroupA 1/1/15 3/1/15
1 GroupB 3/5/15 5/1/15
2 GroupA 1/15/17 2/15/17
2 GroupB 2/1/17 5/1/17
;
proc sql;
create table want as
select a.*,med, start_date, end_date
from one a left join two b
on a.id=b.id and start_date<=date<=end_date;
quit;
Damn! the Proc SQL requires an ORDER BY
proc sql;
create table want as
select a.*,med, start_date, end_date
from one a left join two b
on a.id=b.id and start_date<=date<=end_date
order by a.id,sbp;
quit;
If your data set are sorted chronologically within ID, then this MERGE + conditional SETs will be faster than proc sql:
data EVENT;
input id SBP $ date :mmddyy10.;
format date mmddyy10.;
cards;
1 A 1/10/15
1 B 2/10/15
1 C 3/10/15
1 D 4/10/15
2 A 3/15/17
2 B 6/15/17
run;
data RANGE;
input id med $ (start_date end_date) (:mmddyy10.);
format start_date end_date mmddyy10.;
cards;
1 GroupA 1/1/15 3/1/15
1 GroupB 3/5/15 5/1/15
2 GroupA 1/15/17 2/15/17
2 GroupB 2/1/17 5/1/17
run;
data want (drop=_:);
set EVENT (keep=id date rename=(date=_sort_date) in=inE)
RANGE (keep=id start_date rename=(start_date=_sort_date) in=inR);
by id _sort_date;
if inE then set EVENT;
retain _pre_R .;
if inR then set RANGE;
retain _post_R .;
if inE;
if date < start_date or date > end_date then call missing(of _pre_R -- _post_R);
output;
if last.id then call missing(of _all_);
run;
This relies on the fact that any variable read in by the SET (or the MERGE statement) is, unlike newly created variables, retained by default. But since SET statements are usually executed in each and every iteration of the data step, you don't notice, since new values universally replace the preceding observations.
But in the above, only the ID and _SORT_DATE variables (renamed from DATE and START_DATE) are read in unconditionally. The subsequent IF ... THEN SET statements read in the corresponding dataset and its variables only conditionally. Those variables are retained until the condition is satisfied again. In particular the RANGE dataset will have it variables retained until the _SORT_DATE is populated by the next RANGE observation.
Then the only thing to worry about is the possibility of "stale" range variables (i.e. the event date is out of range). That's where the _PRE_R and _POST_R variables are used as sentinels, making them excellent arguments in the call missing (of ...) statement, where stale variables are set to missing.
Edited additional notes:
For those attending next years Global Forum, I will be doing a paper on exactly this sort of mixing of time series.
And I should add. Regarding the comparson to PROC SQL. Not only is this faster for sorted data sets, it is a LOT easier to extend to mixing 3 or more time series. Just think about having datasets YEAR, QTR, and MONTH, sorted by ID and date. And you want a monthly file with the most recent YEAR and QTR data, but possibly excluding YEAR or QTR data that are older than, say 24 months.
Or say you have irregular data sets ADMISSIONS, LABTESTS, and VISITS, also sorted by ID/date. You want a record for each event of any type, carrying the most recent data from the other types. MERGE by sort key, plus conditional SETs solves this problem.
Very nice @mkeintz . Always learning.learned tonnes from you. However I am intrigued to think that MAGIC=102 aka SORT/MERGE join algortihm can be forced for SQL optimizer for small to medium size tables eliminating the potential nested loop through the tables one by one to be chosen by the optimizer. This could perhaps match or speed more surprisingly than a datastep.
proc sql magic=102;
create table want as
select a.*,med, start_date, end_date
from one a left join two b
on a.id=b.id and start_date<=date<=end_date;
quit;
The scope of datastep being a work horse is of course wide considering the utility but should the requirement limit to just what is stated and nothing beyond, the magic may really do super magic 🙂
Assuming the data sets are already sorted, I'd be interested in seeing situations in which the SQL solution is faster than MERGE plus conditional SETs, even if it's only 2 time series.
Does the SQL proc really take advantage of the fact that the data sets are sorted? If not, then it has to either do its own sort (presumably using a lot more memory) or it has to do a cartesian comparison of DATE against every START_DATE/END_DATE pair with a matching id. And in either case it likely needs more memory.
ps: I had to look up "magic=102". And Google helpfully provided a number of fm radio stations - had to include "sql" in the search phrase.
We must stand up to atavistic urges to immediately resort to proc sql. It can be good to be one with the data step.
I agree. I am keen to test, observe and learn in 2 different environments i.e.
1. If connecting a third party database, explicit vs implicit(libname) and the magic vs sorted source datasets in a datastep as you mentioned where we push the query into the database and still force the desired algorithm.
2. General SAS datasets, I think here yours should be the outright leader nonetheless, I would like to test
At this point, I honestly do not know the outcome to test as I do not have that environment to play with. But indeed a very interesting subject.
PS If there datasets were to be indexed by JOIN keys, more fun to watch for I guess
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.