Solved: Re: Collapsing table based on conditions

rlafond · Hi there,

Hi there,

I'm struggling to come up with a solution to this problem and could use some input.

For each ID, I have a variable amount of date ranges. I am trying to create a single row for each ID that shows 1) the most recent end date (DateThru); and 2) the earliest start date (DateFrom) without a break greater than 1 day (using the most recent DateThru as an anchor).

Here is an idea of what my data looks like:

ID	DateFrom	DateThru
123	10/1/2004	5/25/2006
123	5/26/2006	8/28/2007
123	8/29/2007	3/17/2008
123	3/18/2008	10/14/2008
123	12/7/2015	12/31/2015
123	2/12/2016	12/31/9999

In this example I would want to only output the last row. The second newest date range ended 12/31/15, so it ended greater than 1 day before 2/12/16. All other records are immaterial.

Here is another example:

ID	DateFrom	DateThru
221	1/1/2005	5/26/2008
221	11/11/2011	8/1/2015
221	8/2/2015	10/12/2015
221	10/13/2015	6/30/2018
221	7/1/2018	12/31/9999

In this example I would want to see a single row for ID=221 that shows DateFrom = 11/11/2011 and DateThru = 12/31/9999. All but the first row show continuous date ranges with gaps of 1 day only.

Any thoughts?

SASKiwi · Posted 12-26-2018 06:42 PM

@rlafond - and here is yet a third way that doesn't require any joining of tables:

data have;
  input id datefrom :mmddyy10. datethru :mmddyy10.;
  format date: date9.;
datalines;
123 10/1/2004 5/25/2006
123 5/26/2006 8/28/2007
123 8/29/2007 3/17/2008
123 3/18/2008 10/14/2008
123 12/7/2015 12/31/2015
123 2/12/2016 12/31/9999
221 1/1/2005 5/26/2008 
221 11/11/2011 8/1/2015 
221 8/2/2015 10/12/2015 
221 10/13/2015 6/30/2018 
221 7/1/2018 12/31/9999
; 
run;

proc sort data =  have;
  by id datefrom;
run;

data want;
  keep id datefrom_new datethru;
  rename datefrom_new = datefrom; 
  set have;
  by id;
  retain datefrom_new datethru_last;
  format datefrom_new date9.;
  if first.id then do;
    datethru_last = .;
    datefrom_new = datefrom;
  end;
  else do;
    if datefrom - datethru_last > 1 then datefrom_new = datefrom;
  end;
  if last.id then output; 
  datethru_last = datethru;
run;

View solution in original post

mkeintz · Posted 12-26-2018 04:01 PM

What do you mean by "a break greater than one day"? Does that mean the DATEFROM in a record must start on the day immediately following DATETHRU in the preceding record? Or does it mean that DATEFROM can be the 2nd day after the prior DATETHRU, leaving exactly 1 day uncovered? I.e. what's a "break"?

Assuming you want no uncovered days, then you want to generate continuous time spans constructed from a sequence of records, and keep only the last of those time spans. I assume your data are sorted by ID DATEFROM, and that DATETHRU is always greater than or equal to DATEFROM:

data have;
  input id datefrom :mmddyy10. datethru :mmddyy10.;
  format date: date9.;
datalines;
123 10/1/2004 5/25/2006
123 5/26/2006 8/28/2007
123 8/29/2007 3/17/2008
123 3/18/2008 10/14/2008
123 12/7/2015 12/31/2015
123 2/12/2016 12/31/9999
221 1/1/2005 5/26/2008 
221 11/11/2011 8/1/2015 
221 8/2/2015 10/12/2015 
221 10/13/2015 6/30/2018 
221 7/1/2018 12/31/9999 
run;

data want (drop=_:);
  set have (keep=id);
  by id;
  merge have 
        have (firstobs=2 keep=datefrom rename=(datefrom=_nxt_from));

  if first.id=1 or datefrom-1>lag(datethru) then _datefrom=datefrom;
  else datefrom=_datefrom;

  if last.id=1;
  retain _datefrom;
run;

The task here is to detect whenever the incoming record is either the beginning of an id, or is more than 1 day after the preceding record. In that case store the current DATEFROM value into a retained variable (_DATEFROM), otherwise assign the retained _DATEFROM value into the current DATEFROM variable. No modifications are needed to DATETHRU. Then just keep the last incoming record, which would have its DATEFROM value properly updated, if needed.

Note if you actually will permit uncovered single days in successive records, just change

if first.id=1 or datefrom-1>lag(datethru) then ...

to

if first.id=1 or datefrom-2>lag(datethru) then ...

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

novinosrin · Posted 12-26-2018 04:45 PM


data have;
input ID	(DateFrom 	DateThru) (:mmddyy10.);
format datefrom datethru mmddyy10.;
cards;
123	10/1/2004	5/25/2006
123	5/26/2006	8/28/2007
123	8/29/2007	3/17/2008
123	3/18/2008	10/14/2008
123	12/7/2015	12/31/2015
123	2/12/2016	12/31/9999
221	1/1/2005	5/26/2008
221	11/11/2011	8/1/2015
221	8/2/2015	10/12/2015
221	10/13/2015	6/30/2018
221	7/1/2018	12/31/9999
;
data temp;
set have;
by id;
if first.id then f=1; 
If not first.id and DateFrom ne lag(DateThru)+1 then f+1;
run;

proc sql;
create table want as
select ID, min(datefrom) as Datefrom format=mmddyy10., max(DateThru) as Datethru format=mmddyy10.
from (select * from temp group by id having f=max(f))
group by id, f ;
quit;

SASKiwi · Posted 12-26-2018 06:42 PM

@rlafond - and here is yet a third way that doesn't require any joining of tables:

data have;
  input id datefrom :mmddyy10. datethru :mmddyy10.;
  format date: date9.;
datalines;
123 10/1/2004 5/25/2006
123 5/26/2006 8/28/2007
123 8/29/2007 3/17/2008
123 3/18/2008 10/14/2008
123 12/7/2015 12/31/2015
123 2/12/2016 12/31/9999
221 1/1/2005 5/26/2008 
221 11/11/2011 8/1/2015 
221 8/2/2015 10/12/2015 
221 10/13/2015 6/30/2018 
221 7/1/2018 12/31/9999
; 
run;

proc sort data =  have;
  by id datefrom;
run;

data want;
  keep id datefrom_new datethru;
  rename datefrom_new = datefrom; 
  set have;
  by id;
  retain datefrom_new datethru_last;
  format datefrom_new date9.;
  if first.id then do;
    datethru_last = .;
    datefrom_new = datefrom;
  end;
  else do;
    if datefrom - datethru_last > 1 then datefrom_new = datefrom;
  end;
  if last.id then output; 
  datethru_last = datethru;
run;

rlafond · Posted 12-27-2018 09:36 AM

This wasn't the first answer, but I'm a sucker for DO loops. Thank you!

learsaas · Posted 12-26-2018 08:51 PM

data result;
	set a;
        by id datefrom;
	retain nstart nend;
	if first.id or datefrom^=nend+1 then nstart=datefrom;
	nend=datethru;
	if last.id then do;
		datefrom=nstart;output;
	end;
	drop nstart nend;
run;

https://communities.sas.com/t5/New-SAS-User/How-do-I-reduce-duplicate-rows-observations-by-keeping-r...

SAS Innovate 2025: Call for Content

Classroom Training Available!