BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
buszhangsy
Calcite | Level 5
 

Hello,

 

I have monthly enrollment data and would like to roll them up into continuous enrollment segments. Note that some overlap in monthly enrollment data can happen because they are from different sources. I tried to create a date variable, that is, do date=start to end (30 lines if monthly enrollment is Sep01 to Sep30) and check if dif(date)>1 for triggering discontinuation of current enrollment. However, my data set gets extremely large (300B obs) and run time is several hours. I would appreciate if anyone could provide a much efficient approach to tackle this. An example of data input and desired output are shown below. Thanks!

 

Data input

 

PatientID  Start                 End            

00001      01/01/2018      01/31/2018       

00001      02/01/2018      02/28/2018         

00001      04/01/2018      04/30/2018      

00001      05/01/2018      05/20/2018    

00001      05/15/2018      05/31/2018   

 

Output

PatientID  Start                 End

00001       01/01/2018    02/28/2018

00001       04/01/2018    05/31/2018

1 ACCEPTED SOLUTION

Accepted Solutions
AMSAS
SAS Super FREQ

Hi, 
Yes, the code needed some minor changes, to reflect the difference in the input data. Also it needed a change to cope with multiple patient_id 
Here's updated code, I think this should do what you are looking for

data source ;
	infile cards ;
	input pid $ start :mmddyy. end :mmddyy. ;
	format start date9. end date9. ;
cards ;
00001      01/01/2018      01/31/2018       
00001      02/01/2018      02/28/2018         
00001      04/01/2018      04/30/2018      
00001      05/01/2018      05/21/2018    
00001      05/15/2018      05/31/2018   
00002      01/01/2018      01/31/2018       
00002      02/01/2018      02/28/2018         
00002      04/01/2018      04/30/2018      
00002      05/01/2018      05/31/2018    
00002      05/15/2018      05/20/2018   
;
run ;

proc sort ;
	by pid start end ;
run ;

data output ;
	retain start_date end_date ;
	format start_date date9. end_date date9. ;
	set source ;
	by pid start end ;
	if first.pid=1 then do ;
		start_date=start ;
		end_date=end ;
	end ;
	else do ;
		if start>end_date+1 then do ;
			output ;
			start_date=start ;
			end_date=end ;
		end ;
		else if end>end_date then do ;
			end_date=end ;
		end ;
	end ;
	if last.pid then
		output ;
run ;




View solution in original post

3 REPLIES 3
AMSAS
SAS Super FREQ

Hi,

This can be achieved by using the retain statement and only outputting when there's a gap in the dates

Here's an example using your sample data:

data source ;
	infile cards ;
	input pid $ start :mmddyy. end :mmddyy. ;
	format start date9. end date9. ;
cards ;
00001      01/01/2018      01/31/2018       
00001      02/01/2018      02/28/2018         
00001      04/01/2018      04/30/2018      
00001      05/01/2018      05/20/2018    
00001      05/15/2018      05/31/2018   
;
run ;

data output ;
	retain start_date end_date ;
	format start_date date9. end_date date9. ;
	set source end=eof ;
	if _n_=1 then do ;
		start_date=start ;
		end_date=end ;
	end ;
	else do ;
		if start>end_date+1 then do ;
			output ;
			start_date=start ;
			end_date=end ;
		end ;
		else do ;
			end_date=end ;
		end ;
	end ;
	if eof then
		output ;
run ;
buszhangsy
Calcite | Level 5

Thanks AMSAS! I do have a question. There will be some time periods with overlaps (see below for another example), how should I sort the dataset in order to use the code you provided? As you can see, when SAS processes the last record, it will keep the start_date previously stored but reassign end_date to be 05/20/2018, which is supposed to be 05/31/2018.

 

00001      01/01/2018      01/31/2018       
00001      02/01/2018      02/28/2018         
00001      04/01/2018      04/30/2018      
00001      05/01/2018      05/31/2018    
00001      05/15/2018      05/20/2018   

 

AMSAS
SAS Super FREQ

Hi, 
Yes, the code needed some minor changes, to reflect the difference in the input data. Also it needed a change to cope with multiple patient_id 
Here's updated code, I think this should do what you are looking for

data source ;
	infile cards ;
	input pid $ start :mmddyy. end :mmddyy. ;
	format start date9. end date9. ;
cards ;
00001      01/01/2018      01/31/2018       
00001      02/01/2018      02/28/2018         
00001      04/01/2018      04/30/2018      
00001      05/01/2018      05/21/2018    
00001      05/15/2018      05/31/2018   
00002      01/01/2018      01/31/2018       
00002      02/01/2018      02/28/2018         
00002      04/01/2018      04/30/2018      
00002      05/01/2018      05/31/2018    
00002      05/15/2018      05/20/2018   
;
run ;

proc sort ;
	by pid start end ;
run ;

data output ;
	retain start_date end_date ;
	format start_date date9. end_date date9. ;
	set source ;
	by pid start end ;
	if first.pid=1 then do ;
		start_date=start ;
		end_date=end ;
	end ;
	else do ;
		if start>end_date+1 then do ;
			output ;
			start_date=start ;
			end_date=end ;
		end ;
		else if end>end_date then do ;
			end_date=end ;
		end ;
	end ;
	if last.pid then
		output ;
run ;




hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 3042 views
  • 0 likes
  • 2 in conversation