Hi have the following data:
data have;
input id$ facility$ start end;
length id facility $1 start end 4.;
informat start end date9.;
format start end mmddyy10.;
datalines;
1 A 01jan2010 31jan2010
1 A 01mar2010 31mar2010
1 A 01apr2010 30apr2010
1 A 15jun2010 30jun2010
1 B 15jul2010 31aug2010
1 B 01nov2010 31dec2010
1 B 31jan2011 01mar2011
1 B 03mar2011 31aug2011
1 A 01sep2011 31oct2011
1 A 01dec2011 31dec2011
;
run;
I would like to be able to combine all rows into a single row within a facility if difference between start date of current row and end date of previous row is =< 31. So the new start date would be the start date of earliest of combined rows and end date would be the end date of last row within a facility. The data I want looks like:
ID | Facility | Start | End | Comment |
1 | A | 01jan2010 | 30apr2010 | Combines rows 1-3 as gap between current start date and previous end date =< 31 days |
1 | A | 15jun2010 | 30jun2010 | New row as gap between current start date and previous end date > 31 days |
1 | B | 15jul2010 | 31aug2010 | New row as facility changes even if gap condition is satisfied |
1 | B | 01nov2010 | 31aug2011 | New row even if same facility as previous one but the gap is > 31 days |
1 | A | 01sep2011 | 31dec2011 | Rows 9-10 combined as they are within same facility and gape between start date of row 10 and end date of row 9 is =< 31 days |
Original data has ~400 million records and is sorted by ID and START but not by FACILITY. I would prefer to avoid resorting and merging. Records from one ID should not be retained into a different ID. Any suggestion are highly appreciated. Thanks
So assuming you do not want to collapse stays that use multiple facilities you can use the NOTSORTED keyword on the BY statement.
First let's setup the example data and make sure it sorted as you indicated.
data have;
length id facility $1 start end 4;
input id facility start end;
informat start end date.;
format start end yymmdd10.;
datalines;
1 A 01jan2010 31jan2010
1 A 01mar2010 31mar2010
1 A 01apr2010 30apr2010
1 A 15jun2010 30jun2010
1 B 15jul2010 31aug2010
1 B 01nov2010 31dec2010
1 B 31jan2011 01mar2011
1 B 03mar2011 31aug2011
1 A 01sep2011 31oct2011
1 A 01dec2011 31dec2011
;
proc sort; by id start ; run;
Let's but the GAP value into a macro variable so we can more easily change it.
%let gap=31 ;
Now just process the data by ID and FACILITY. Remember the first start and latest end and output a record when you hit a gap or the last record in the facility.
data stays;
do until(last.facility);
set have;
by id facility notsorted ;
length stay first_start last_stop 4 ;
format first_start last_stop yymmdd10.;
if first.facility then link next;
else if start > (last_stop + &gap) then do;
output;
link next;
end;
else last_stop=end;
end;
output;
return;
next:
stay=sum(stay,1);
first_start=start;
last_stop=end ;
return;
drop start end ;
run;
Results
If you set the allowed GAP to only 15 days then will find more distinct stays.
Some kind of sort operation is needed, regardless of technique chosen.
I would go for the approach of having the data sorted by id (assume facilities within each id is what you are looking for), facility and start date.
Then use REATAIN for two new variables to keep your start and and dates across observations.
And then an explicit OUTPUT when the gap is greater than 31 days or your reach an new id/facility combination.
Hi LinusH,
Thanks for responding. As I mentioned in my original question, I would strongly prefer a solution without needing another sort on the data. It is already sorted by ID and start date. Original dataset has about 400 million records and 100s of columns.
I think This question is more complicated than you can expected.
Anyway ,the following code could give you a good start.
data have;
input id$ facility$ start end;
length id facility $1 start end 4.;
informat start end date9.;
format start end mmddyy10.;
datalines;
1 A 01jan2010 31jan2010
1 A 01mar2010 31mar2010
1 A 01apr2010 30apr2010
1 A 15jun2010 30jun2010
1 B 15jul2010 31aug2010
1 B 01nov2010 31dec2010
1 B 31jan2011 01mar2011
1 B 03mar2011 31aug2011
1 A 01sep2011 31oct2011
1 A 01dec2011 31dec2011
;
run;
data temp;
set have;
do date=start to end;
output;
end;
format date date9.;
keep id facility date;
run;
proc sort data=temp nodupkey;
by id date;
run;
data temp2;
set temp;
by id facility notsorted;
if first.facility or dif(date)>31 then group+1;
run;
proc sql;
create table want as
select group,min(id) as id,min(facility) as facility,
min(date) as start format=date9.,max(date) as end format=date9.
from temp2
group by group;
quit;
Hi Ksharp,
Thanks for your suggestion. I think this could work but as I mentioned in my original post that the actual dataset has about 400 million rows and 100+ columns. Each data step takes 7-9 hours depending on complexity. I was thinking it could be done if there was a lead() like function where you could load the value of facility into current row from the subsequent row, it could in combination with lag() to compute gap could achieve what I want.
Thanks,
Arorata
Not sure what you plan to do with the values of those other hundred variables, but it you need to keep the values from the final observation for a group then using "lead" might be what you need.
In that case use two SET statements with the second one offset by one.
Let's add a ROW variable to simulate what would happen to those auxiliary variables.
data want;
set have ;
by id facility notsorted;
row+1;
set have(firstobs=2 keep=start rename=(start=next_start)) have(obs=1 drop=_all_);
if first.facility then first_start=start;
format first_start yymmdd10.;
retain first_start ;
if not last.facility then gap=next_start-end;
if (gap>31) or last.facility then do;
start=first_start;
output;
first_start=next_start;
end;
drop first_start next_start;
run;
Result
Hi Tom,
I just mentioned the number of columns because some of the solutions suggested resorting the original dataset or using multiple data steps, which I wanted to avoid given the huge size. You are correct that I do not need the last values for other variables so your original solution works for me. Just out of curiosity, does having two set statements offsetting by a row in order to "look into the future" have almost double the overhead in terms of computation time compared to the usual single set statement commonly used?
Kind regards,
No, because both SAS and your operating system will cache the file. So the second SET will get its observation from the cache most of the time. And when it doesn't the the next time the first SET runs it will find the observation in the cache.
Thanks for answering that. Much appreciated.
If your data looks like so clear as what you posted, that would be very easy.
data have;
input id$ facility$ start end;
length id facility $ 8;
informat start end date9.;
format start end mmddyy10.;
datalines;
1 A 01jan2010 31jan2010
1 A 01mar2010 31mar2010
1 A 01apr2010 30apr2010
1 A 15jun2010 30jun2010
1 B 15jul2010 31aug2010
1 B 01nov2010 31dec2010
1 B 31jan2011 01mar2011
1 B 03mar2011 31aug2011
1 A 01sep2011 31oct2011
1 A 01dec2011 31dec2011
;
run;
data temp;
set have;
by id facility notsorted;
if first.facility or (start-lag(end))>31 then _group+1;
run;
data want;
set temp(rename=(start=_start));
by _group;
retain start .;
if first._group then start=_start;
if last._group;
drop _group _start;
format end start date9.;
run;
If there were two visits to facility A within 31 days of each other, but there was a visit to facility B in between, would you want to collapse the two visits to A? or leave them as separate?
Thinking of data like:
1 A 01jan2010 31jan2010 1 B 01feb2010 07feb2010 1 A 08feb2010 28feb2010
Hi Quentin,
Thanks for the response. I would leave them separate. Only consecutive visits within same facility need to be collapsed.
Kind regards,
So assuming you do not want to collapse stays that use multiple facilities you can use the NOTSORTED keyword on the BY statement.
First let's setup the example data and make sure it sorted as you indicated.
data have;
length id facility $1 start end 4;
input id facility start end;
informat start end date.;
format start end yymmdd10.;
datalines;
1 A 01jan2010 31jan2010
1 A 01mar2010 31mar2010
1 A 01apr2010 30apr2010
1 A 15jun2010 30jun2010
1 B 15jul2010 31aug2010
1 B 01nov2010 31dec2010
1 B 31jan2011 01mar2011
1 B 03mar2011 31aug2011
1 A 01sep2011 31oct2011
1 A 01dec2011 31dec2011
;
proc sort; by id start ; run;
Let's but the GAP value into a macro variable so we can more easily change it.
%let gap=31 ;
Now just process the data by ID and FACILITY. Remember the first start and latest end and output a record when you hit a gap or the last record in the facility.
data stays;
do until(last.facility);
set have;
by id facility notsorted ;
length stay first_start last_stop 4 ;
format first_start last_stop yymmdd10.;
if first.facility then link next;
else if start > (last_stop + &gap) then do;
output;
link next;
end;
else last_stop=end;
end;
output;
return;
next:
stay=sum(stay,1);
first_start=start;
last_stop=end ;
return;
drop start end ;
run;
Results
If you set the allowed GAP to only 15 days then will find more distinct stays.
Hi Tom,
Thanks for your suggestion. I think this will work for me with very minor tweaks. I will mark this as a solution.
Kind regards,
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.