Re: Collapsing Dates across rows of observations with a twist.

pegasus1225 · Posted 11-13-2017 11:18 AM

Hi everyone,

I have an input data that looks like this. The data has two subjects with ID1 and ID2

ID Start_date End_date

1 1/1/2010 4/1/2010

1 4/24/2010 6/1/2010

1 9/1/2010 10/15/2010

2 3/1/2010 5/1/2010

2 6/1/2010 10/1/2010

Basically, I want to collapse the multiple start and end dates for each subject into one contiguous start and end date per each subject. However, there is a twist. If the gap between the end date and next start date for the subject is greater than 45 days, then the dates are not collapsed for that row of observation. I have highlighted the desired start and end dates for each subject.

The output dataset should look like

ID Start_date End_date

1 1/1/2010 6/1/2010

1 9/1/2010 10/15/2010

2 3/1/2010 10/1/2010

Note that subject 1 has a gap greater than 45 days (from 6/1/2010 to 9/1/2010), so the row from 9/1/2010 to 10/15/2010 is not collapsed with the first two rows for subject 1.

Subject 2 has no gap greater than 45 days, so his two rows of start and end dates are collapsed.

Thank you so much for your help.

gary

RW9 · Posted 11-13-2017 11:34 AM

Post test data in the form of a datastep!

As such I am not typing that in to test this, and I assume that is a date.

data want;
  set have;
  retain start_date_ end_date_;
  by id;
  if first.id or start_date_=. then start_date_=start_date;
  else do;
    if last.id or end_date-lag(end_date) > 45 then do;
      end_date_=end_date;
      output;
      start_date_=.;
      end_date_=.;
    end;
   end;
run;

Note the output variables - the want, is with a _ after.

HB · Posted 11-13-2017 12:19 PM

@RW9 I'm not sure that quite works.

If I do this:

Data Input;
  input id start_date:mmddyy10. end_date:mmddyy10.;
  datalines;
1 1/1/2010 4/1/2010
1 4/24/2010 6/1/2010
1 9/1/2010 10/15/2010 
2 3/1/2010 5/1/2010
2 6/1/2010 10/1/2010
;
run;

data output;
  set input;
  retain start_date_ end_date_;
  by id;
  if first.id or start_date_=. then start_date_=start_date;
  else do;
    if last.id or end_date-lag(end_date) > 45 then do;
      end_date_=end_date;
      output;
      start_date_=.;
      end_date_=.;
    end;
   end;
run;

proc print data = output noobs;
	var id start_date_ end_date_;
	format start_date_ date9. end_date_ date9.;
run;

I get this:

                                       The SAS System      

                                           start_
                                  id        date_    end_date_

                                   1    01JAN2010    15OCT2010
                                   2    01MAR2010    01OCT2010

which I don't think is exactly what OP wants. It seems to skip over the 6/1 to 9/1 break for ID 1 in which OP is interested.

Astounding · Posted 11-13-2017 11:57 AM

I think it would be easiest for you to follow the logic if you were to break this into two steps (even though it can be done in one). Here is an example of the first step. Assuming your data is already sorted by ID START_DATE:

data have2;

set have;

by id;

prior_end = lag(end_date);

if first.id or (start_date - prior_end) > 45 then grouping + 1;

run;

Then you can take a look at the result, and determine whether GROUPING matches your expectations. If it does, continue with the second step:

data want;

set have2;

by id grouping;

if first.grouping then new_start = start_date;

retain new_start;

drop start_date;

rename new_start = start_date;

if last.grouping;

run;

HB · Posted 11-13-2017 12:28 PM

Using the data from above:

data have2;
	set input;
	by id;
	prior_end = lag(end_date);
	if first.id or (start_date - prior_end) > 45 then grouping + 1;
run;

data want;
	set have2;
	by id grouping;
	if first.grouping then new_start = start_date;
	retain new_start;
	drop start_date;
	rename new_start = start_date;
	if last.grouping;
run;

proc print data = want noobs;
	var id start_date end_date;
	format start_date date9. end_date date9.;
run;

Seems to yield

                                         The SAS System 

                                           start_
                                  id         date     end_date

                                   1    01JAN2010    01JUN2010
                                   1    01SEP2010    15OCT2010
                                   2    01MAR2010    01OCT2010

I wish I understood either of the above approaches. Lol.

Astounding · Posted 11-13-2017 01:28 PM

@HB,

To understand how this solution works, inspect the intermediate data set more carefully. See if you understand how GROUPING is being created. (Note that it might in theory be the same for 3 or more observations, not just for two observations.)

Once you understand what is in GROUPING, think about how you could use GROUPING to create the final data set. Then compare your thoughts to the second DATA part of the solution.

HB · Posted 11-13-2017 02:45 PM

@Astounding

Left to my own devices, my approach would have been to restructure the data to look like

Data AlternateInput;
  input id mydate:mmddyy10. datetype;
  datalines;
1 1/1/2010 start
1 4/1/2010 end
1 4/24/2010 start
1 6/1/2010 end
1 9/1/2010 start
1 10/15/2010 end 
2 3/1/2010 start
2 5/1/2010 end
2 6/1/2010 start
2 10/1/2010 end
;

and then try to figure out how to walk the ID's to find the periods. Probably best I'm not left to solve it myself. Ha.

mkeintz · Posted 11-13-2017 01:09 PM

Here is a single DATA step processs that takes advantage of executing lag functions selectively, as well as a self-merge with FIRSTOBS=2:

data have;
 input ID     (Start_date      End_date) (:mmddyy10.);
 format start_date end_date date9.;
datalines;
1       1/1/2010         4/1/2010
1       4/24/2010      6/1/2010
1      9/1/2010       10/15/2010 
2       3/1/2010       5/1/2010
2       6/1/2010      10/1/2010
run;


data want (drop=next_:);

  merge have
        have (firstobs=2 keep=id start_date rename=(id=next_id start_date=next_start));

  if next_id^=id or next_start>end_date+45;

  start_date=lag(next_start);
  if start_date=. then set have (keep=start_date obs=1);
run;

Notes:

The merge statement matches record N of HAVE with record N+1 (yielding new vars NEXT_ID and NEXT_START.
The subsetting IF statement only allow records from the end of a qualified period - therefore providing the desired END_DATE
The start_date=lag(next_date) only update the lag(next_date) queue when a complete period has been identified.
The lag(next_date) value is missing for the first desired period, so observation 1 is re-read only for that instance.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

pegasus1225 · Posted 11-13-2017 04:09 PM

Thank you RW9, HB, Astounding and Mkeintz for your big help.

lag function is very useful.

Classroom Training Available!