Hi everyone,
I have an input data that looks like this. The data has two subjects with ID1 and ID2
ID Start_date End_date
1 1/1/2010 4/1/2010
1 4/24/2010 6/1/2010
1 9/1/2010 10/15/2010
2 3/1/2010 5/1/2010
2 6/1/2010 10/1/2010
Basically, I want to collapse the multiple start and end dates for each subject into one contiguous start and end date per each subject. However, there is a twist. If the gap between the end date and next start date for the subject is greater than 45 days, then the dates are not collapsed for that row of observation. I have highlighted the desired start and end dates for each subject.
The output dataset should look like
ID Start_date End_date
1 1/1/2010 6/1/2010
1 9/1/2010 10/15/2010
2 3/1/2010 10/1/2010
Note that subject 1 has a gap greater than 45 days (from 6/1/2010 to 9/1/2010), so the row from 9/1/2010 to 10/15/2010 is not collapsed with the first two rows for subject 1.
Subject 2 has no gap greater than 45 days, so his two rows of start and end dates are collapsed.
Thank you so much for your help.
gary
Post test data in the form of a datastep!
As such I am not typing that in to test this, and I assume that is a date.
data want; set have; retain start_date_ end_date_; by id; if first.id or start_date_=. then start_date_=start_date; else do; if last.id or end_date-lag(end_date) > 45 then do; end_date_=end_date; output; start_date_=.; end_date_=.; end; end; run;
Note the output variables - the want, is with a _ after.
@RW9 I'm not sure that quite works.
If I do this:
Data Input;
input id start_date:mmddyy10. end_date:mmddyy10.;
datalines;
1 1/1/2010 4/1/2010
1 4/24/2010 6/1/2010
1 9/1/2010 10/15/2010
2 3/1/2010 5/1/2010
2 6/1/2010 10/1/2010
;
run;
data output;
set input;
retain start_date_ end_date_;
by id;
if first.id or start_date_=. then start_date_=start_date;
else do;
if last.id or end_date-lag(end_date) > 45 then do;
end_date_=end_date;
output;
start_date_=.;
end_date_=.;
end;
end;
run;
proc print data = output noobs;
var id start_date_ end_date_;
format start_date_ date9. end_date_ date9.;
run;
I get this:
The SAS System
start_
id date_ end_date_
1 01JAN2010 15OCT2010
2 01MAR2010 01OCT2010
which I don't think is exactly what OP wants. It seems to skip over the 6/1 to 9/1 break for ID 1 in which OP is interested.
I think it would be easiest for you to follow the logic if you were to break this into two steps (even though it can be done in one). Here is an example of the first step. Assuming your data is already sorted by ID START_DATE:
data have2;
set have;
by id;
prior_end = lag(end_date);
if first.id or (start_date - prior_end) > 45 then grouping + 1;
run;
Then you can take a look at the result, and determine whether GROUPING matches your expectations. If it does, continue with the second step:
data want;
set have2;
by id grouping;
if first.grouping then new_start = start_date;
retain new_start;
drop start_date;
rename new_start = start_date;
if last.grouping;
run;
Using the data from above:
data have2;
set input;
by id;
prior_end = lag(end_date);
if first.id or (start_date - prior_end) > 45 then grouping + 1;
run;
data want;
set have2;
by id grouping;
if first.grouping then new_start = start_date;
retain new_start;
drop start_date;
rename new_start = start_date;
if last.grouping;
run;
proc print data = want noobs;
var id start_date end_date;
format start_date date9. end_date date9.;
run;
Seems to yield
The SAS System
start_
id date end_date
1 01JAN2010 01JUN2010
1 01SEP2010 15OCT2010
2 01MAR2010 01OCT2010
I wish I understood either of the above approaches. Lol.
@HB,
To understand how this solution works, inspect the intermediate data set more carefully. See if you understand how GROUPING is being created. (Note that it might in theory be the same for 3 or more observations, not just for two observations.)
Once you understand what is in GROUPING, think about how you could use GROUPING to create the final data set. Then compare your thoughts to the second DATA part of the solution.
Left to my own devices, my approach would have been to restructure the data to look like
Data AlternateInput;
input id mydate:mmddyy10. datetype;
datalines;
1 1/1/2010 start
1 4/1/2010 end
1 4/24/2010 start
1 6/1/2010 end
1 9/1/2010 start
1 10/15/2010 end
2 3/1/2010 start
2 5/1/2010 end
2 6/1/2010 start
2 10/1/2010 end
;
and then try to figure out how to walk the ID's to find the periods. Probably best I'm not left to solve it myself. Ha.
Here is a single DATA step processs that takes advantage of executing lag functions selectively, as well as a self-merge with FIRSTOBS=2:
data have;
input ID (Start_date End_date) (:mmddyy10.);
format start_date end_date date9.;
datalines;
1 1/1/2010 4/1/2010
1 4/24/2010 6/1/2010
1 9/1/2010 10/15/2010
2 3/1/2010 5/1/2010
2 6/1/2010 10/1/2010
run;
data want (drop=next_:);
merge have
have (firstobs=2 keep=id start_date rename=(id=next_id start_date=next_start));
if next_id^=id or next_start>end_date+45;
start_date=lag(next_start);
if start_date=. then set have (keep=start_date obs=1);
run;
Notes:
Thank you RW9, HB, Astounding and Mkeintz for your big help.
lag function is very useful.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.