Solved: Compare Two Datasets and Search for Overlapping Date Ranges

djbateman · Posted 10-23-2020 01:32 PM

I have 2 datasets: planned hospitalizations (PHOSP) and unplanned hospitalizations (UHOSP). A subject can be found in both tables and have multiple hospitalizations in each table. I want to compare the two tables to see if any of the planned dates overlap with unplanned dates and retain the overlapping records.

In my mind, I would transpose UHOSP twice: once for the start dates and again for the end dates. Then I would merge PHOSP with the transposed start dates and transposed end dates by SUBJID. I could then see for each PHOSP sequence if the start/end dates overlap with any of the UHOSP start/end dates by looping through the maximum number of possible UHOSP sequences. I feel that there is a much easier, simpler, more efficient way to do this.

Below is some test data. You can see that SUBJID=102-711-401 has no overlaps between PHOSP and UHSOP, so they will not show up in the output. But SUBJID=102-715-401 does overlap between PHSEQ=1 and UHSEQ=2. I would like to then output these 2 records stacked on top of each other.

Are there any ideas out there?

data phosp;
	length SUBJID $11;
	input SUBJID $ PHSEQ PHSTDT PHENDT;
	informat PHSTDT PHENDT date9.;
	format PHSTDT PHENDT date9.;
	cards;
102-711-401	1	07JAN2020	17JAN2020
102-711-401	2	02JUL2020	09JUL2020
102-715-401	1	01APR2020	15APR2020
102-715-401	2	13JAN2020	16JAN2020
;
run;
data uhosp;
	length SUBJID $11;
	input SUBJID $ UHSEQ UHSTDT UHENDT;
	informat UHSTDT UHENDT date9.;
	format UHSTDT UHENDT date9.;
	cards;
102-711-401	1	27JAN2019	06MAY2019
102-711-401	2	05MAR2020	09MAR2020
102-715-401	1	31OCT2019	03NOV2019
102-715-401	2	15JAN2020	19JAN2020
102-715-401	3	04JUL2020	12JUL2020
;
run;


proc transpose data=uhosp out=uhosp_st (drop=_name_) prefix=UHST;
	id uhseq;
	by subjid;
	var uhstdt;
run;
proc transpose data=uhosp out=uhosp_en (drop=_name_) prefix=UHEN;
	id uhseq;
	by subjid;
	var uhendt;
run;

proc sql noprint;
	select max(uhseq) into :maxseq from uhosp;
	%let maxseq=%sysfunc(compress(&maxseq.));
quit;
data compare;
	merge phosp uhosp_st uhosp_en;
	by subjid;
	array start {&maxseq.} UHST1-UHST&maxseq.;
	array end {&maxseq.} UHEN1-UHEN&maxseq.;
	do i=1 to &maxseq.;
		if start(i)<=phstdt<=end(i) | start(i)<=phendt<=end(i) then flag=i;
	end;
run;

ballardw · Posted 10-23-2020 02:15 PM

I would suggest something more along these lines:

proc sql;
   create table overlap as
   select a.subjid, a.PHSEQ, a.PHSTDT, a.PHENDT
         ,b.UHSEQ, b.UHSTDT, b.UHENDT 
   from  phosp as a
         left join
         uhosp as b
         on a.subjid=b.subjid
   where  ( a.PHSTDT le b.UHSTDT le a.PHENDT)
       or ( a.PHSTDT le b.UHENDT le a.PHENDT) 
       or ( b.UHSTDT le a.PHSTDT le b.UHENDT)
       or ( b.UHSTDT le a.PHENDT le b.UHENDT) 
   ;
quit;

Which gets ungainly if comparing many more intervals but no macro variables needed so a bit easier to follow.

Since you are only interested in overlaps starting with a left or right join on the id means only matched ids are considered at all. The WHERE filters on the date ranges.

View solution in original post

ballardw · Posted 10-23-2020 02:15 PM

I would suggest something more along these lines:

proc sql;
   create table overlap as
   select a.subjid, a.PHSEQ, a.PHSTDT, a.PHENDT
         ,b.UHSEQ, b.UHSTDT, b.UHENDT 
   from  phosp as a
         left join
         uhosp as b
         on a.subjid=b.subjid
   where  ( a.PHSTDT le b.UHSTDT le a.PHENDT)
       or ( a.PHSTDT le b.UHENDT le a.PHENDT) 
       or ( b.UHSTDT le a.PHSTDT le b.UHENDT)
       or ( b.UHSTDT le a.PHENDT le b.UHENDT) 
   ;
quit;

Which gets ungainly if comparing many more intervals but no macro variables needed so a bit easier to follow.

Since you are only interested in overlaps starting with a left or right join on the id means only matched ids are considered at all. The WHERE filters on the date ranges.

djbateman · Posted 10-23-2020 02:25 PM

I knew SQL could come to my rescue! There is so much in SQL that I don't think of. I just rely too heavily on data step programming. I need to branch out. But this is much easier to follow and makes my code look so much cleaner. Thank you so much!

ballardw · Posted 10-23-2020 03:48 PM

@djbateman wrote:
I knew SQL could come to my rescue! There is so much in SQL that I don't think of. I just rely too heavily on data step programming. I need to branch out. But this is much easier to follow and makes my code look so much cleaner. Thank you so much!

One of the main things to indicate SQL might be in the solution is that you have multiples of the identification variable(s) in both sets. That means that a simple data step merge won't work. Second is that you need to compare ALL of the identification values from one with all of the other in the second set.

A data step approach that might work would be rename the date variables when combining the data sets with SET statement; add a source variable using either the In data set option or the INDSNAME set option; sort by id and the start date and then look for changes from source to source for the id but that would be bit trickier to confirm the overlap with use of LAG values.

If you haven't use the INDSNAME option a pseudo example:

Data new;
    set data1 data2 indsname=somevar;
   length source $ 41;
   source = somevar;
run;

I use a length of 41 because the SAS temporary variable created by the indsname would have the library.datasetname. Which has 8 characters for the library, 1 for the . and 32 for the data set name: 8+1+32 = 41. You may not need all of them when you copy the code from another project your data set names don't get truncated.

Data set option approach:

data new;
   set data1 (in=in1)
        data2 (in=in2)
   ;
   if in1 then source='data1';
   else if in2 then source='data2';
run;

Or create source as numeric an any of a number of ways. Slick but not obvious is something like:

source = in1*1 + in2*2;

The in variables are boolean 1/0 when the current record comes from the indicated data set. The main advantage over this approach is you can easily add source values as long as you get the correct in variable and the order value. I know I had least one project where I wanted this indicator to be numeric but don't remember the exact reason why at the moment.

Note IN variables and the INDSNAME variables are all temporary and will be dropped by SAS. If you need to keep the information you are responsible for assigning it to a new variable to keep in the data.

Compare Two Datasets and Search for Overlapping Date Ranges

Re: Compare Two Datasets and Search for Overlapping Date Ranges

Re: Compare Two Datasets and Search for Overlapping Date Ranges

Re: Compare Two Datasets and Search for Overlapping Date Ranges

Re: Compare Two Datasets and Search for Overlapping Date Ranges

Compare Two Datasets and Search for Overlapping Date Ranges

Re: Compare Two Datasets and Search for Overlapping Date Ranges

Re: Compare Two Datasets and Search for Overlapping Date Ranges

Re: Compare Two Datasets and Search for Overlapping Date Ranges

Re: Compare Two Datasets and Search for Overlapping Date Ranges

Click image to register for webinar

Classroom Training Available!