Solved: Combining continuous date ranges by ID

thelowendHz · Posted 06-11-2024 10:45 AM

Here is a small sample of a data set that looks like this:

DATA HAVE;
INPUT ID $10. START_DT :MMDDYY10. END_DT :MMDDYY10.;
format START_DT MMDDYY10. END_DT MMDDYY10.;
datalines;
1004064609 04/01/2023 12/31/2023
1004064609 01/01/2024 12/31/3999
1004064623 04/01/2023 12/31/2023
1004064623 01/01/2024 12/31/3999
1004064703 03/06/2024 03/28/2024
1004064703 03/29/2024 12/31/3999
1004064706 04/01/2023 12/31/2023
1004064706 01/01/2024 12/31/3999
1004064724 04/01/2023 06/30/2023
1004064724 07/01/2023 11/30/2023
1004064724 12/01/2023 12/31/2023
1004064724 01/01/2024 12/31/3999
1004064726 04/01/2023 10/31/2023
1004064726 12/05/2023 12/31/2023
1004064726 01/01/2024 05/31/2024
1004064726 06/01/2024 12/31/3999
1004064727 04/01/2023 12/31/2023
1004064727 01/01/2024 12/31/3999
;
RUN;

For each unique ID I want to collapse continuous date spans whenever possible. For example, for the first ID 1004064609 I would want a single record where START_DT = 04/01/2023 AND END_DT = 12/31/3999. Note that some IDs will have multiple date spans that aren't continuous and those should continue to exist as separate records. What's a solution to do this? Thank you.

data_null__ · Posted 06-11-2024 11:57 AM

I changed your infinity date to 2050 from 3999 (too far). This method expands the ranges and then looks for gaps GT 1.

DATA HAVE;
   INPUT ID $10. START_DT :MMDDYY10. END_DT :MMDDYY10.;
   format START_DT MMDDYY10. END_DT MMDDYY10.;
   datalines;
1004064609 04/01/2023 12/31/2023
1004064609 01/01/2024 12/31/2050
1004064623 04/01/2023 12/31/2023
1004064623 01/01/2024 12/31/2050
1004064703 03/06/2024 03/28/2024
1004064703 03/29/2024 12/31/2050
1004064706 04/01/2023 12/31/2023
1004064706 01/01/2024 12/31/2050
1004064724 04/01/2023 06/30/2023
1004064724 07/01/2023 11/30/2023
1004064724 12/01/2023 12/31/2023
1004064724 01/01/2024 12/31/2050
1004064726 04/01/2023 10/31/2023
1004064726 12/05/2023 12/31/2023
1004064726 01/01/2024 05/31/2024
1004064726 06/01/2024 12/31/2050
1004064727 04/01/2023 12/31/2023
1004064727 01/01/2024 12/31/2050
;
   RUN;

data expand / view=expand;
   set have;
   do date=start_dt to end_dt;
      output;
      end;
   keep id date;
   format date yymmdd10.;
   run;

proc summary data=expand nway;
   by id;
   class date;
   output out=unique(drop=_:);
   run;
data group / view=group;
   set unique;
   by id;
   if first.id then group=0;
   if dif(date) ne 1 then group+1;
   run;
proc summary data=group nway missing;
   by id;
   class group;
   output out=range(drop=_:) min(date)=start max(date)=end range(date)=Range;
   format range: ; 
   run;

View solution in original post

data_null__ · Posted 06-11-2024 11:57 AM

I changed your infinity date to 2050 from 3999 (too far). This method expands the ranges and then looks for gaps GT 1.

DATA HAVE;
   INPUT ID $10. START_DT :MMDDYY10. END_DT :MMDDYY10.;
   format START_DT MMDDYY10. END_DT MMDDYY10.;
   datalines;
1004064609 04/01/2023 12/31/2023
1004064609 01/01/2024 12/31/2050
1004064623 04/01/2023 12/31/2023
1004064623 01/01/2024 12/31/2050
1004064703 03/06/2024 03/28/2024
1004064703 03/29/2024 12/31/2050
1004064706 04/01/2023 12/31/2023
1004064706 01/01/2024 12/31/2050
1004064724 04/01/2023 06/30/2023
1004064724 07/01/2023 11/30/2023
1004064724 12/01/2023 12/31/2023
1004064724 01/01/2024 12/31/2050
1004064726 04/01/2023 10/31/2023
1004064726 12/05/2023 12/31/2023
1004064726 01/01/2024 05/31/2024
1004064726 06/01/2024 12/31/2050
1004064727 04/01/2023 12/31/2023
1004064727 01/01/2024 12/31/2050
;
   RUN;

data expand / view=expand;
   set have;
   do date=start_dt to end_dt;
      output;
      end;
   keep id date;
   format date yymmdd10.;
   run;

proc summary data=expand nway;
   by id;
   class date;
   output out=unique(drop=_:);
   run;
data group / view=group;
   set unique;
   by id;
   if first.id then group=0;
   if dif(date) ne 1 then group+1;
   run;
proc summary data=group nway missing;
   by id;
   class group;
   output out=range(drop=_:) min(date)=start max(date)=end range(date)=Range;
   format range: ; 
   run;

ballardw · Posted 06-11-2024 03:43 PM

If you don't have overlapping intervals then perhaps:

/* assumes Have is sorted by ID and Start_dt*/
data temp;
   set have;
   by id ;
   retain tstart;
   lend = lag(end_dt);
   if first.id then tstart=start_dt;
   else if intck('day',lend,start_dt)=1 then start_dt=tstart;
   else tstart=start_dt;
   drop tstart lend;
run;
data want;
  set temp;
  by id start_dt;
  if last.start_dt;
run;

JOL · Posted 06-11-2024 04:27 PM

This solution uses PROC SORT and Group processing in the DATA STEP.

proc sort data=have out=have2;
by id start_dt end_dt;
run;

data have2;
set have2;
retain id2 start_dt2;
by id start_dt end_dt;

if first.id then
do;
id2=id;
start_dt2=start_dt;
end;

if last.id then
end_dt2=end_dt;

if last.id;
format start_dt2 end_dt2 mmddyy10.;
keep id2 start_dt2 end_dt2;
run;

Combining continuous date ranges by ID

Re: Combining continuous date ranges by ID

Re: Combining continuous date ranges by ID

Re: Combining continuous date ranges by ID

Re: Combining continuous date ranges by ID

Combining continuous date ranges by ID

Re: Combining continuous date ranges by ID

Re: Combining continuous date ranges by ID

Re: Combining continuous date ranges by ID

Re: Combining continuous date ranges by ID

Registration is open

SAS Training: Just a Click Away