- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here is a small sample of a data set that looks like this:
DATA HAVE;
INPUT ID $10. START_DT :MMDDYY10. END_DT :MMDDYY10.;
format START_DT MMDDYY10. END_DT MMDDYY10.;
datalines;
1004064609 04/01/2023 12/31/2023
1004064609 01/01/2024 12/31/3999
1004064623 04/01/2023 12/31/2023
1004064623 01/01/2024 12/31/3999
1004064703 03/06/2024 03/28/2024
1004064703 03/29/2024 12/31/3999
1004064706 04/01/2023 12/31/2023
1004064706 01/01/2024 12/31/3999
1004064724 04/01/2023 06/30/2023
1004064724 07/01/2023 11/30/2023
1004064724 12/01/2023 12/31/2023
1004064724 01/01/2024 12/31/3999
1004064726 04/01/2023 10/31/2023
1004064726 12/05/2023 12/31/2023
1004064726 01/01/2024 05/31/2024
1004064726 06/01/2024 12/31/3999
1004064727 04/01/2023 12/31/2023
1004064727 01/01/2024 12/31/3999
;
RUN;
For each unique ID I want to collapse continuous date spans whenever possible. For example, for the first ID 1004064609 I would want a single record where START_DT = 04/01/2023 AND END_DT = 12/31/3999. Note that some IDs will have multiple date spans that aren't continuous and those should continue to exist as separate records. What's a solution to do this? Thank you.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I changed your infinity date to 2050 from 3999 (too far). This method expands the ranges and then looks for gaps GT 1.
DATA HAVE;
INPUT ID $10. START_DT :MMDDYY10. END_DT :MMDDYY10.;
format START_DT MMDDYY10. END_DT MMDDYY10.;
datalines;
1004064609 04/01/2023 12/31/2023
1004064609 01/01/2024 12/31/2050
1004064623 04/01/2023 12/31/2023
1004064623 01/01/2024 12/31/2050
1004064703 03/06/2024 03/28/2024
1004064703 03/29/2024 12/31/2050
1004064706 04/01/2023 12/31/2023
1004064706 01/01/2024 12/31/2050
1004064724 04/01/2023 06/30/2023
1004064724 07/01/2023 11/30/2023
1004064724 12/01/2023 12/31/2023
1004064724 01/01/2024 12/31/2050
1004064726 04/01/2023 10/31/2023
1004064726 12/05/2023 12/31/2023
1004064726 01/01/2024 05/31/2024
1004064726 06/01/2024 12/31/2050
1004064727 04/01/2023 12/31/2023
1004064727 01/01/2024 12/31/2050
;
RUN;
data expand / view=expand;
set have;
do date=start_dt to end_dt;
output;
end;
keep id date;
format date yymmdd10.;
run;
proc summary data=expand nway;
by id;
class date;
output out=unique(drop=_:);
run;
data group / view=group;
set unique;
by id;
if first.id then group=0;
if dif(date) ne 1 then group+1;
run;
proc summary data=group nway missing;
by id;
class group;
output out=range(drop=_:) min(date)=start max(date)=end range(date)=Range;
format range: ;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I changed your infinity date to 2050 from 3999 (too far). This method expands the ranges and then looks for gaps GT 1.
DATA HAVE;
INPUT ID $10. START_DT :MMDDYY10. END_DT :MMDDYY10.;
format START_DT MMDDYY10. END_DT MMDDYY10.;
datalines;
1004064609 04/01/2023 12/31/2023
1004064609 01/01/2024 12/31/2050
1004064623 04/01/2023 12/31/2023
1004064623 01/01/2024 12/31/2050
1004064703 03/06/2024 03/28/2024
1004064703 03/29/2024 12/31/2050
1004064706 04/01/2023 12/31/2023
1004064706 01/01/2024 12/31/2050
1004064724 04/01/2023 06/30/2023
1004064724 07/01/2023 11/30/2023
1004064724 12/01/2023 12/31/2023
1004064724 01/01/2024 12/31/2050
1004064726 04/01/2023 10/31/2023
1004064726 12/05/2023 12/31/2023
1004064726 01/01/2024 05/31/2024
1004064726 06/01/2024 12/31/2050
1004064727 04/01/2023 12/31/2023
1004064727 01/01/2024 12/31/2050
;
RUN;
data expand / view=expand;
set have;
do date=start_dt to end_dt;
output;
end;
keep id date;
format date yymmdd10.;
run;
proc summary data=expand nway;
by id;
class date;
output out=unique(drop=_:);
run;
data group / view=group;
set unique;
by id;
if first.id then group=0;
if dif(date) ne 1 then group+1;
run;
proc summary data=group nway missing;
by id;
class group;
output out=range(drop=_:) min(date)=start max(date)=end range(date)=Range;
format range: ;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you don't have overlapping intervals then perhaps:
/* assumes Have is sorted by ID and Start_dt*/ data temp; set have; by id ; retain tstart; lend = lag(end_dt); if first.id then tstart=start_dt; else if intck('day',lend,start_dt)=1 then start_dt=tstart; else tstart=start_dt; drop tstart lend; run; data want; set temp; by id start_dt; if last.start_dt; run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This solution uses PROC SORT and Group processing in the DATA STEP.
proc sort data=have out=have2;
by id start_dt end_dt;
run;
data have2;
set have2;
retain id2 start_dt2;
by id start_dt end_dt;
if first.id then
do;
id2=id;
start_dt2=start_dt;
end;
if last.id then
end_dt2=end_dt;
if last.id;
format start_dt2 end_dt2 mmddyy10.;
keep id2 start_dt2 end_dt2;
run;