BookmarkSubscribeRSS Feed
pegasus1225
Calcite | Level 5

Hi everyone,

 

I have an input data that looks like this. The data has two subjects with ID1 and ID2

 

ID     Start_date      End_date

1       1/1/2010         4/1/2010

1       4/24/2010      6/1/2010

1      9/1/2010       10/15/2010 

2       3/1/2010       5/1/2010

2       6/1/2010      10/1/2010

 

Basically, I want to collapse the multiple start and end dates for each subject into one contiguous start and end date per each subject. However, there is a twist.  If the gap between the end date and next start date for the subject is greater than 45 days, then the dates are not collapsed for that row of observation.  I have highlighted the desired start and end dates for each subject.

 

The output dataset should look like

ID     Start_date      End_date

1       1/1/2010         6/1/2010

1      9/1/2010       10/15/2010

2       3/1/2010       10/1/2010

 

Note that subject 1 has a gap greater than 45 days (from 6/1/2010 to 9/1/2010), so the row from 9/1/2010  to 10/15/2010 is not collapsed with the first two rows for subject 1.

 

Subject 2 has no gap greater than 45 days, so his two rows of start and end dates are collapsed. 

 

Thank you so much for your help.

gary

 

8 REPLIES 8
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Post test data in the form of a datastep!

As such I am not typing that in to test this, and I assume that is a date.

data want;
  set have;
  retain start_date_ end_date_;
  by id;
  if first.id or start_date_=. then start_date_=start_date;
  else do;
    if last.id or end_date-lag(end_date) > 45 then do;
      end_date_=end_date;
      output;
      start_date_=.;
      end_date_=.;
    end;
   end;
run;
      

Note the output variables - the want, is with a _ after.

HB
Barite | Level 11 HB
Barite | Level 11

@RW9 I'm not sure that quite works.

 

If I do this:

Data Input;
  input id start_date:mmddyy10. end_date:mmddyy10.;
  datalines;
1 1/1/2010 4/1/2010
1 4/24/2010 6/1/2010
1 9/1/2010 10/15/2010 
2 3/1/2010 5/1/2010
2 6/1/2010 10/1/2010
;
run;

data output;
  set input;
  retain start_date_ end_date_;
  by id;
  if first.id or start_date_=. then start_date_=start_date;
  else do;
    if last.id or end_date-lag(end_date) > 45 then do;
      end_date_=end_date;
      output;
      start_date_=.;
      end_date_=.;
    end;
   end;
run;

proc print data = output noobs;
	var id start_date_ end_date_;
	format start_date_ date9. end_date_ date9.;
run;

I get this:

 

                                       The SAS System      

                                           start_
                                  id        date_    end_date_

                                   1    01JAN2010    15OCT2010
                                   2    01MAR2010    01OCT2010

which I don't think is exactly what OP wants.  It seems to skip over the 6/1 to 9/1 break for ID 1 in which OP is interested. 

Astounding
PROC Star

I think it would be easiest for you to follow the logic if you were to break this into two steps (even though it can be done in one).  Here is an example of the first step.  Assuming your data is already sorted by ID START_DATE:

 

data have2;

set have;

by id;

prior_end = lag(end_date);

if first.id or (start_date - prior_end) > 45 then grouping + 1;

run;

 

Then you can take a look at the result, and determine whether GROUPING matches your expectations.  If it does, continue with the second step:

 

data want;

set have2;

by id grouping;

if first.grouping then new_start = start_date;

retain new_start;

drop start_date;

rename new_start = start_date;

if last.grouping;

run;

 

HB
Barite | Level 11 HB
Barite | Level 11

Using the data from above:

 

data have2;
	set input;
	by id;
	prior_end = lag(end_date);
	if first.id or (start_date - prior_end) > 45 then grouping + 1;
run;

data want;
	set have2;
	by id grouping;
	if first.grouping then new_start = start_date;
	retain new_start;
	drop start_date;
	rename new_start = start_date;
	if last.grouping;
run;

proc print data = want noobs;
	var id start_date end_date;
	format start_date date9. end_date date9.;
run;

Seems to yield

                                         The SAS System 

                                           start_
                                  id         date     end_date

                                   1    01JAN2010    01JUN2010
                                   1    01SEP2010    15OCT2010
                                   2    01MAR2010    01OCT2010

 

 

I wish I understood either of the above approaches.  Lol. 

Astounding
PROC Star

@HB,

 

To understand how this solution works, inspect the intermediate data set more carefully.  See if  you understand how GROUPING is being created.  (Note that it might in theory be the same for 3 or more observations, not just for two observations.)

 

Once you understand what is in GROUPING, think about how you could use GROUPING to create the final data set.  Then compare your thoughts to the second DATA part of the solution.

HB
Barite | Level 11 HB
Barite | Level 11

@Astounding

 

Left to my own devices, my approach would have been to restructure the data to look like

 

Data AlternateInput;
  input id mydate:mmddyy10. datetype;
  datalines;
1 1/1/2010 start
1 4/1/2010 end
1 4/24/2010 start
1 6/1/2010 end
1 9/1/2010 start
1 10/15/2010 end 
2 3/1/2010 start
2 5/1/2010 end
2 6/1/2010 start
2 10/1/2010 end
;

and then try to figure out how to walk the ID's to find the periods.  Probably best I'm not left to solve it myself. Ha.

mkeintz
PROC Star

Here is a single DATA step processs that takes advantage of executing lag functions selectively, as well as a self-merge with FIRSTOBS=2:

 

data have;
 input ID     (Start_date      End_date) (:mmddyy10.);
 format start_date end_date date9.;
datalines;
1       1/1/2010         4/1/2010
1       4/24/2010      6/1/2010
1      9/1/2010       10/15/2010 
2       3/1/2010       5/1/2010
2       6/1/2010      10/1/2010
run;


data want (drop=next_:);

  merge have
        have (firstobs=2 keep=id start_date rename=(id=next_id start_date=next_start));

  if next_id^=id or next_start>end_date+45;

  start_date=lag(next_start);
  if start_date=. then set have (keep=start_date obs=1);
run;

 

 

Notes:

  1. The merge statement matches record N of HAVE with record N+1 (yielding new vars NEXT_ID and NEXT_START.
  2. The subsetting IF statement only allow records from the end of a qualified period - therefore providing the desired END_DATE
  3. The start_date=lag(next_date) only update the lag(next_date) queue when a complete period has been identified. 
  4. The lag(next_date) value is missing for the first desired period, so observation 1 is re-read only for that instance.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
pegasus1225
Calcite | Level 5

Thank you 

 

 

 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 2318 views
  • 0 likes
  • 5 in conversation