11-13-2017 11:18 AM
I have an input data that looks like this. The data has two subjects with ID1 and ID2
ID Start_date End_date
1 1/1/2010 4/1/2010
1 4/24/2010 6/1/2010
1 9/1/2010 10/15/2010
2 3/1/2010 5/1/2010
2 6/1/2010 10/1/2010
Basically, I want to collapse the multiple start and end dates for each subject into one contiguous start and end date per each subject. However, there is a twist. If the gap between the end date and next start date for the subject is greater than 45 days, then the dates are not collapsed for that row of observation. I have highlighted the desired start and end dates for each subject.
The output dataset should look like
ID Start_date End_date
1 1/1/2010 6/1/2010
1 9/1/2010 10/15/2010
2 3/1/2010 10/1/2010
Note that subject 1 has a gap greater than 45 days (from 6/1/2010 to 9/1/2010), so the row from 9/1/2010 to 10/15/2010 is not collapsed with the first two rows for subject 1.
Subject 2 has no gap greater than 45 days, so his two rows of start and end dates are collapsed.
Thank you so much for your help.
11-13-2017 11:34 AM
Post test data in the form of a datastep!
As such I am not typing that in to test this, and I assume that is a date.
data want; set have; retain start_date_ end_date_; by id; if first.id or start_date_=. then start_date_=start_date; else do; if last.id or end_date-lag(end_date) > 45 then do; end_date_=end_date; output; start_date_=.; end_date_=.; end; end; run;
Note the output variables - the want, is with a _ after.
11-13-2017 12:19 PM
@RW9 I'm not sure that quite works.
If I do this:
Data Input; input id start_date:mmddyy10. end_date:mmddyy10.; datalines; 1 1/1/2010 4/1/2010 1 4/24/2010 6/1/2010 1 9/1/2010 10/15/2010 2 3/1/2010 5/1/2010 2 6/1/2010 10/1/2010 ; run; data output; set input; retain start_date_ end_date_; by id; if first.id or start_date_=. then start_date_=start_date; else do; if last.id or end_date-lag(end_date) > 45 then do; end_date_=end_date; output; start_date_=.; end_date_=.; end; end; run; proc print data = output noobs; var id start_date_ end_date_; format start_date_ date9. end_date_ date9.; run;
I get this:
The SAS System start_ id date_ end_date_ 1 01JAN2010 15OCT2010 2 01MAR2010 01OCT2010
which I don't think is exactly what OP wants. It seems to skip over the 6/1 to 9/1 break for ID 1 in which OP is interested.
11-13-2017 11:57 AM
I think it would be easiest for you to follow the logic if you were to break this into two steps (even though it can be done in one). Here is an example of the first step. Assuming your data is already sorted by ID START_DATE:
prior_end = lag(end_date);
if first.id or (start_date - prior_end) > 45 then grouping + 1;
Then you can take a look at the result, and determine whether GROUPING matches your expectations. If it does, continue with the second step:
by id grouping;
if first.grouping then new_start = start_date;
rename new_start = start_date;
11-13-2017 12:28 PM - edited 11-13-2017 12:29 PM
Using the data from above:
data have2; set input; by id; prior_end = lag(end_date); if first.id or (start_date - prior_end) > 45 then grouping + 1; run; data want; set have2; by id grouping; if first.grouping then new_start = start_date; retain new_start; drop start_date; rename new_start = start_date; if last.grouping; run; proc print data = want noobs; var id start_date end_date; format start_date date9. end_date date9.; run;
Seems to yield
The SAS System start_ id date end_date 1 01JAN2010 01JUN2010 1 01SEP2010 15OCT2010 2 01MAR2010 01OCT2010
I wish I understood either of the above approaches. Lol.
11-13-2017 01:28 PM
To understand how this solution works, inspect the intermediate data set more carefully. See if you understand how GROUPING is being created. (Note that it might in theory be the same for 3 or more observations, not just for two observations.)
Once you understand what is in GROUPING, think about how you could use GROUPING to create the final data set. Then compare your thoughts to the second DATA part of the solution.
11-13-2017 02:45 PM
Left to my own devices, my approach would have been to restructure the data to look like
Data AlternateInput; input id mydate:mmddyy10. datetype; datalines; 1 1/1/2010 start 1 4/1/2010 end 1 4/24/2010 start 1 6/1/2010 end 1 9/1/2010 start 1 10/15/2010 end 2 3/1/2010 start 2 5/1/2010 end 2 6/1/2010 start 2 10/1/2010 end ;
and then try to figure out how to walk the ID's to find the periods. Probably best I'm not left to solve it myself. Ha.
11-13-2017 01:09 PM
Here is a single DATA step processs that takes advantage of executing lag functions selectively, as well as a self-merge with FIRSTOBS=2:
data have; input ID (Start_date End_date) (:mmddyy10.); format start_date end_date date9.; datalines; 1 1/1/2010 4/1/2010 1 4/24/2010 6/1/2010 1 9/1/2010 10/15/2010 2 3/1/2010 5/1/2010 2 6/1/2010 10/1/2010 run; data want (drop=next_:); merge have have (firstobs=2 keep=id start_date rename=(id=next_id start_date=next_start)); if next_id^=id or next_start>end_date+45; start_date=lag(next_start); if start_date=. then set have (keep=start_date obs=1); run;