Hello,
I have monthly enrollment data and would like to roll them up into continuous enrollment segments. Note that some overlap in monthly enrollment data can happen because they are from different sources. I tried to create a date variable, that is, do date=start to end (30 lines if monthly enrollment is Sep01 to Sep30) and check if dif(date)>1 for triggering discontinuation of current enrollment. However, my data set gets extremely large (300B obs) and run time is several hours. I would appreciate if anyone could provide a much efficient approach to tackle this. An example of data input and desired output are shown below. Thanks!
Data input
PatientID Start End
00001 01/01/2018 01/31/2018
00001 02/01/2018 02/28/2018
00001 04/01/2018 04/30/2018
00001 05/01/2018 05/20/2018
00001 05/15/2018 05/31/2018
Output
PatientID Start End
00001 01/01/2018 02/28/2018
00001 04/01/2018 05/31/2018
Hi,
Yes, the code needed some minor changes, to reflect the difference in the input data. Also it needed a change to cope with multiple patient_id
Here's updated code, I think this should do what you are looking for
data source ;
infile cards ;
input pid $ start :mmddyy. end :mmddyy. ;
format start date9. end date9. ;
cards ;
00001 01/01/2018 01/31/2018
00001 02/01/2018 02/28/2018
00001 04/01/2018 04/30/2018
00001 05/01/2018 05/21/2018
00001 05/15/2018 05/31/2018
00002 01/01/2018 01/31/2018
00002 02/01/2018 02/28/2018
00002 04/01/2018 04/30/2018
00002 05/01/2018 05/31/2018
00002 05/15/2018 05/20/2018
;
run ;
proc sort ;
by pid start end ;
run ;
data output ;
retain start_date end_date ;
format start_date date9. end_date date9. ;
set source ;
by pid start end ;
if first.pid=1 then do ;
start_date=start ;
end_date=end ;
end ;
else do ;
if start>end_date+1 then do ;
output ;
start_date=start ;
end_date=end ;
end ;
else if end>end_date then do ;
end_date=end ;
end ;
end ;
if last.pid then
output ;
run ;
Hi,
This can be achieved by using the retain statement and only outputting when there's a gap in the dates
Here's an example using your sample data:
data source ; infile cards ; input pid $ start :mmddyy. end :mmddyy. ; format start date9. end date9. ; cards ; 00001 01/01/2018 01/31/2018 00001 02/01/2018 02/28/2018 00001 04/01/2018 04/30/2018 00001 05/01/2018 05/20/2018 00001 05/15/2018 05/31/2018 ; run ; data output ; retain start_date end_date ; format start_date date9. end_date date9. ; set source end=eof ; if _n_=1 then do ; start_date=start ; end_date=end ; end ; else do ; if start>end_date+1 then do ; output ; start_date=start ; end_date=end ; end ; else do ; end_date=end ; end ; end ; if eof then output ; run ;
Thanks AMSAS! I do have a question. There will be some time periods with overlaps (see below for another example), how should I sort the dataset in order to use the code you provided? As you can see, when SAS processes the last record, it will keep the start_date previously stored but reassign end_date to be 05/20/2018, which is supposed to be 05/31/2018.
00001 01/01/2018 01/31/2018 00001 02/01/2018 02/28/2018 00001 04/01/2018 04/30/2018 00001 05/01/2018 05/31/2018 00001 05/15/2018 05/20/2018
Hi,
Yes, the code needed some minor changes, to reflect the difference in the input data. Also it needed a change to cope with multiple patient_id
Here's updated code, I think this should do what you are looking for
data source ;
infile cards ;
input pid $ start :mmddyy. end :mmddyy. ;
format start date9. end date9. ;
cards ;
00001 01/01/2018 01/31/2018
00001 02/01/2018 02/28/2018
00001 04/01/2018 04/30/2018
00001 05/01/2018 05/21/2018
00001 05/15/2018 05/31/2018
00002 01/01/2018 01/31/2018
00002 02/01/2018 02/28/2018
00002 04/01/2018 04/30/2018
00002 05/01/2018 05/31/2018
00002 05/15/2018 05/20/2018
;
run ;
proc sort ;
by pid start end ;
run ;
data output ;
retain start_date end_date ;
format start_date date9. end_date date9. ;
set source ;
by pid start end ;
if first.pid=1 then do ;
start_date=start ;
end_date=end ;
end ;
else do ;
if start>end_date+1 then do ;
output ;
start_date=start ;
end_date=end ;
end ;
else if end>end_date then do ;
end_date=end ;
end ;
end ;
if last.pid then
output ;
run ;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.