Solved: Re: How to combine data from multiple rows into a single row based on ...

arorata · Posted 09-23-2025 03:26 PM

Hi have the following data:

data have;
	input id$ facility$  start end;
	length id facility $1 start end 4.;
	informat start end date9.;
	format start end mmddyy10.;
	datalines;
	1 A 01jan2010 31jan2010
	1 A 01mar2010 31mar2010
	1 A 01apr2010 30apr2010
	1 A 15jun2010 30jun2010
	1 B 15jul2010 31aug2010
	1 B 01nov2010 31dec2010
	1 B 31jan2011 01mar2011
	1 B 03mar2011 31aug2011
	1 A 01sep2011 31oct2011
	1 A 01dec2011 31dec2011
	;
run;

I would like to be able to combine all rows into a single row within a facility if difference between start date of current row and end date of previous row is =< 31. So the new start date would be the start date of earliest of combined rows and end date would be the end date of last row within a facility. The data I want looks like:

ID	Facility	Start	End	Comment
1	A	`01jan2010`	`30apr2010`	Combines rows 1-3 as gap between current start date and previous end date =< 31 days
1	A	`15jun2010`	`30jun2010`	New row as gap between current start date and previous end date > 31 days
1	B	`15jul2010`	`31aug2010`	New row as facility changes even if gap condition is satisfied
1	B	`01nov2010`	`31aug2011`	New row even if same facility as previous one but the gap is > 31 days
1	A	`01sep2011`	`31dec2011`	Rows 9-10 combined as they are within same facility and gape between start date of row 10 and end date of row 9 is =< 31 days

Original data has ~400 million records and is sorted by ID and START but not by FACILITY. I would prefer to avoid resorting and merging. Records from one ID should not be retained into a different ID. Any suggestion are highly appreciated. Thanks

Tom · Posted 09-24-2025 09:03 AM

So assuming you do not want to collapse stays that use multiple facilities you can use the NOTSORTED keyword on the BY statement.

First let's setup the example data and make sure it sorted as you indicated.

Spoiler

data have;
	length id facility $1 start end 4;
	input id facility start end;
	informat start end date.;
	format start end yymmdd10.;
datalines;
1 A 01jan2010 31jan2010
1 A 01mar2010 31mar2010
1 A 01apr2010 30apr2010
1 A 15jun2010 30jun2010
1 B 15jul2010 31aug2010
1 B 01nov2010 31dec2010
1 B 31jan2011 01mar2011
1 B 03mar2011 31aug2011
1 A 01sep2011 31oct2011
1 A 01dec2011 31dec2011
;

proc sort; by id start ; run;

data have; length id facility $1 start end 4; input id facility start end; informat start end date.; format start end yymmdd10.; datalines; 1 A 01jan2010 31jan2010 1 A 01mar2010 31mar2010 1 A 01apr2010 30apr2010 1 A 15jun2010 30jun2010 1 B 15jul2010 31aug2010 1 B 01nov2010 31dec2010 1 B 31jan2011 01mar2011 1 B 03mar2011 31aug2011 1 A 01sep2011 31oct2011 1 A 01dec2011 31dec2011 ; proc sort; by id start ; run;

Let's but the GAP value into a macro variable so we can more easily change it.

%let gap=31 ;

Now just process the data by ID and FACILITY. Remember the first start and latest end and output a record when you hit a gap or the last record in the facility.

data stays;
 do until(last.facility);
   set have;
   by id facility notsorted ;
   length stay first_start last_stop 4 ;
   format first_start last_stop yymmdd10.;
   if first.facility then link next;
   else if start > (last_stop + &gap) then do;
      output;
      link next;
   end;
   else last_stop=end;
  end;
  output;
return;
next:
  stay=sum(stay,1);
  first_start=start; 
  last_stop=end ; 
return;
  drop start end ;
run;

Results

If you set the allowed GAP to only 15 days then will find more distinct stays.

View solution in original post

LinusH · Posted 09-24-2025 02:04 AM

Some kind of sort operation is needed, regardless of technique chosen.

I would go for the approach of having the data sorted by id (assume facilities within each id is what you are looking for), facility and start date.

Then use REATAIN for two new variables to keep your start and and dates across observations.

And then an explicit OUTPUT when the gap is greater than 31 days or your reach an new id/facility combination.

Data never sleeps

arorata · Posted 09-24-2025 11:45 AM

Hi LinusH,

Thanks for responding. As I mentioned in my original question, I would strongly prefer a solution without needing another sort on the data. It is already sorted by ID and start date. Original dataset has about 400 million records and 100s of columns.

Ksharp · Posted 09-24-2025 04:08 AM

I think This question is more complicated than you can expected.

Anyway ,the following code could give you a good start.

data have;
	input id$ facility$  start end;
	length id facility $1 start end 4.;
	informat start end date9.;
	format start end mmddyy10.;
	datalines;
	1 A 01jan2010 31jan2010
	1 A 01mar2010 31mar2010
	1 A 01apr2010 30apr2010
	1 A 15jun2010 30jun2010
	1 B 15jul2010 31aug2010
	1 B 01nov2010 31dec2010
	1 B 31jan2011 01mar2011
	1 B 03mar2011 31aug2011
	1 A 01sep2011 31oct2011
	1 A 01dec2011 31dec2011
	;
run;
data temp;
 set have;
 do date=start to end;
  output;
 end;
 format date date9.;
 keep id facility date;
run;
proc sort data=temp nodupkey;
by id date;
run;

data temp2;
 set temp;
 by id facility notsorted;
 if first.facility or dif(date)>31 then group+1;
run;

proc sql;
create table want as
select group,min(id) as id,min(facility) as facility,
   min(date) as start format=date9.,max(date) as end format=date9.
    from temp2
	 group by group;
quit;

arorata · Posted 09-24-2025 11:54 AM

Hi Ksharp,

Thanks for your suggestion. I think this could work but as I mentioned in my original post that the actual dataset has about 400 million rows and 100+ columns. Each data step takes 7-9 hours depending on complexity. I was thinking it could be done if there was a lead() like function where you could load the value of facility into current row from the subsequent row, it could in combination with lag() to compute gap could achieve what I want.

Thanks,

Arorata

Tom · Posted 09-24-2025 12:45 PM

Not sure what you plan to do with the values of those other hundred variables, but it you need to keep the values from the final observation for a group then using "lead" might be what you need.

In that case use two SET statements with the second one offset by one.

Let's add a ROW variable to simulate what would happen to those auxiliary variables.

data want;
  set have ;
  by id facility notsorted;
  row+1;
  set have(firstobs=2 keep=start rename=(start=next_start)) have(obs=1 drop=_all_);
  if first.facility then first_start=start;
  format first_start yymmdd10.;
  retain first_start ;
  if not last.facility then gap=next_start-end;
  if (gap>31) or last.facility then do;
    start=first_start;
    output;
    first_start=next_start;
  end;
  drop first_start next_start;
run;

Result

arorata · Posted 09-24-2025 01:06 PM

Hi Tom,

I just mentioned the number of columns because some of the solutions suggested resorting the original dataset or using multiple data steps, which I wanted to avoid given the huge size. You are correct that I do not need the last values for other variables so your original solution works for me. Just out of curiosity, does having two set statements offsetting by a row in order to "look into the future" have almost double the overhead in terms of computation time compared to the usual single set statement commonly used?

Kind regards,

Tom · Posted 09-24-2025 01:15 PM

No, because both SAS and your operating system will cache the file. So the second SET will get its observation from the cache most of the time. And when it doesn't the the next time the first SET runs it will find the observation in the cache.

arorata · Posted 09-24-2025 01:31 PM

Thanks for answering that. Much appreciated.

Ksharp · Posted 09-25-2025 03:05 AM

If your data looks like so clear as what you posted, that would be very easy.

data have;
	input id$ facility$  start end;
	length id facility $ 8;
	informat start end date9.;
	format start end mmddyy10.;
	datalines;
	1 A 01jan2010 31jan2010
	1 A 01mar2010 31mar2010
	1 A 01apr2010 30apr2010
	1 A 15jun2010 30jun2010
	1 B 15jul2010 31aug2010
	1 B 01nov2010 31dec2010
	1 B 31jan2011 01mar2011
	1 B 03mar2011 31aug2011
	1 A 01sep2011 31oct2011
	1 A 01dec2011 31dec2011
	;
run;
data temp;
 set have;
 by id facility notsorted;
 if first.facility or (start-lag(end))>31 then _group+1;
run;
data want;
set temp(rename=(start=_start));
by _group;
retain start .;
if first._group then start=_start;
if last._group;
drop _group _start;
format end start date9.;
run;

arorata · Posted 09-25-2025 11:31 AM

Hi Ksharp,
Thanks for the program. It works. If I could I would mark this too as a solution. Appreciate the help.
Kind regards,

Quentin · Posted 09-24-2025 08:24 AM

If there were two visits to facility A within 31 days of each other, but there was a visit to facility B in between, would you want to collapse the two visits to A? or leave them as separate?

Thinking of data like:

  1 A 01jan2010 31jan2010
  1 B 01feb2010 07feb2010
  1 A 08feb2010 28feb2010

arorata · Posted 09-24-2025 11:57 AM

Hi Quentin,

Thanks for the response. I would leave them separate. Only consecutive visits within same facility need to be collapsed.

Kind regards,

Tom · Posted 09-24-2025 09:03 AM

So assuming you do not want to collapse stays that use multiple facilities you can use the NOTSORTED keyword on the BY statement.

First let's setup the example data and make sure it sorted as you indicated.

Spoiler

data have;
	length id facility $1 start end 4;
	input id facility start end;
	informat start end date.;
	format start end yymmdd10.;
datalines;
1 A 01jan2010 31jan2010
1 A 01mar2010 31mar2010
1 A 01apr2010 30apr2010
1 A 15jun2010 30jun2010
1 B 15jul2010 31aug2010
1 B 01nov2010 31dec2010
1 B 31jan2011 01mar2011
1 B 03mar2011 31aug2011
1 A 01sep2011 31oct2011
1 A 01dec2011 31dec2011
;

proc sort; by id start ; run;

data have; length id facility $1 start end 4; input id facility start end; informat start end date.; format start end yymmdd10.; datalines; 1 A 01jan2010 31jan2010 1 A 01mar2010 31mar2010 1 A 01apr2010 30apr2010 1 A 15jun2010 30jun2010 1 B 15jul2010 31aug2010 1 B 01nov2010 31dec2010 1 B 31jan2011 01mar2011 1 B 03mar2011 31aug2011 1 A 01sep2011 31oct2011 1 A 01dec2011 31dec2011 ; proc sort; by id start ; run;

Let's but the GAP value into a macro variable so we can more easily change it.

%let gap=31 ;

Now just process the data by ID and FACILITY. Remember the first start and latest end and output a record when you hit a gap or the last record in the facility.

data stays;
 do until(last.facility);
   set have;
   by id facility notsorted ;
   length stay first_start last_stop 4 ;
   format first_start last_stop yymmdd10.;
   if first.facility then link next;
   else if start > (last_stop + &gap) then do;
      output;
      link next;
   end;
   else last_stop=end;
  end;
  output;
return;
next:
  stay=sum(stay,1);
  first_start=start; 
  last_stop=end ; 
return;
  drop start end ;
run;

Results

If you set the allowed GAP to only 15 days then will find more distinct stays.

arorata · Posted 09-24-2025 01:00 PM

Hi Tom,

Thanks for your suggestion. I think this will work for me with very minor tweaks. I will mark this as a solution.

Kind regards,

How to combine data from multiple rows into a single row based on conditions

Re: How to combine data from multiple rows into a single row based on conditions

Re: How to combine data from multiple rows into a single row based on conditions

Re: How to combine data from multiple rows into a single row based on conditions

Re: How to combine data from multiple rows into a single row based on conditions

Re: How to combine data from multiple rows into a single row based on conditions

Re: How to combine data from multiple rows into a single row based on conditions

Re: How to combine data from multiple rows into a single row based on conditions

Re: How to combine data from multiple rows into a single row based on conditions

Re: How to combine data from multiple rows into a single row based on conditions

Re: How to combine data from multiple rows into a single row based on conditions

Re: How to combine data from multiple rows into a single row based on conditions

Re: How to combine data from multiple rows into a single row based on conditions

Re: How to combine data from multiple rows into a single row based on conditions

Re: How to combine data from multiple rows into a single row based on conditions

Re: How to combine data from multiple rows into a single row based on conditions

Registration is open

Registration is open

SAS Training: Just a Click Away