Hi all,
I need to create a continous enrollment dataset from a dataset which has enrollment for all months across 2011-2017. The dataset I have looks like this :
enrolid dtstart dtend
420 07/02/2011 07/31/2011
420 08/01/2011 08/31/2011
420 09/01/2011 09/30/2011
500 07/02/2011 07/31/2011
I only want a dataset looking like this:
enrolid period_start period_stop
420 07/01/2011 09/30/2011
How can i do this using retain and lag functions ?
There should be no gaps in the dates.. they should be continous
I am using Batch SAS
Thank you @PaigeMiller. I wasn't paying sufficient attention.
Now lag helps in detecting start of a new time span.
data want;
set have;
by enrolid;
merge have ( keep=dtend rename=(dtend=dtend_last))
have (firstobs=2 keep=dtstart rename=(dtstart=dtstart_next));
if first.enrolid or dtstart>sum(lag(dtend),1) then dtstart_beg=dtstart;
retain dtstart_beg;
if last.enrolid or dtstart_next>dtend+1;
drop dtstart dtend dtstart_next;
run;
data have;
input enrolid dtstart :mmddyy10. dtend :mmddyy10. ;
format dtstart dtend mmddyy10. ;
cards;
420 07/02/2011 07/31/2011
420 08/01/2011 08/31/2011
420 09/01/2011 09/30/2011
500 07/02/2011 07/31/2011
;
proc sql;
create table want as
select enrolid, min(dtstart) as period_start format= mmddyy10. ,max(dtend) as period_stop format= mmddyy10.
from have
group by enrolid;
quit;
I am new to sql could you show how to do this using SAS retain and lag functions ?
data want;
set have;
by enrolid;
retain _t;
if first.enrolid then _t=dtstart;
if last.enrolid then do;period_start=_t;
period_stop=dtend;output;end;
format period_start period_stop mmddyy10.;
drop dt: _:;
run;
Note: This assumes your dataset is sorted
Are you guaranteeing that there are no gaps in the dates?
If there are gaps in the dates would you want a second observation for the same ID?
There are gaps in the data.
if the gap is more than a day we have another row for the same enrolid..
There are gaps in the dates but whenever they are there we have another row for the enrollment id. gap means more than 1 day of no enrollment
I don't see a need for RETAIN here.
Assumes data is sorted by ENROLID and DTSTART.
*** UNTESTED CODE ***
data want;
set have;
by enrolid;
prev_dtend=lag(dtend);
if dtstart ^= prev_dtend+1 then continuous_dates+1;
run;
proc summary nway data=want;
class enrolid continuous_dates;
var dtstart dtend;
output out=want2 min(dtstart)=period_start max(dtend)=period_stop;
run;
You don't need a lag for this task. But retain is useful:
data want;
set have (rename=(dtend=dtend_last));
by enrolid;
if first.enrolid then dtstart_beg=dtstart;
retain dtstart_beg;
if last.enrolid;
run;
This doesn't meet the user requirement
There should be no gaps in the dates.. they should be continous
Thank you @PaigeMiller. I wasn't paying sufficient attention.
Now lag helps in detecting start of a new time span.
data want;
set have;
by enrolid;
merge have ( keep=dtend rename=(dtend=dtend_last))
have (firstobs=2 keep=dtstart rename=(dtstart=dtstart_next));
if first.enrolid or dtstart>sum(lag(dtend),1) then dtstart_beg=dtstart;
retain dtstart_beg;
if last.enrolid or dtstart_next>dtend+1;
drop dtstart dtend dtstart_next;
run;
what does
dtstart>sum(lag(dtend),1)
mean ?
and also i am not clear what this does ?
if last.enrolid or dtstart_next>dtend+1;
@manya92 wrote:
what does
dtstart>sum(lag(dtend),1)
mean ?
and also i am not clear what this does ?
if last.enrolid or dtstart_next>dtend+1;
dtstart>sum*lag(dtend),1)
is almost identical to
dtstart>lag(dtend)+1
except the latter will generate an error note for the first observation - because lag of anything is missing for the first obs.
The IF statement above is a "subsetting if" - i.e. the data step only keeps cases that satisfy the if expression.
The SET statement starts one data stream, and the MERGE statement starts another data stream. And they are synchronized streams.
Of course if you have two streams of data from data set HAVE, you'll have any common variable being read once from each stream. But you can't have two values occupy one variable, so I use the KEEP= and RENAME= parameters to avoid such collisions.
The SET statement is used for nothing more than to generate FIRST.ENROLID and LAST.ENROLID variables for easy detection of boundaries for each ID. The MERGE statement has one of the merged sub-streams start at FIRSTOBS=2 so that you can look ahead one observation. This makes it easy to tell if the next observation starts more than one day after the current obs ends.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.