Hi all,
So I have a new problem to solve with the data I've been working on. I am looking at clients that can have multiple episodes and each episode can have multiple events. I want to make it so that the dates on concurrent events have no gaps and do not overlap.
Here is what a simplified version of the data looks like:
Client_ID | Episode_ID | Event_ID | Begin_Date | End_Date | Event_Plcmnt |
1 | 1 | 1 | 20-May-19 | 23-May-19 | GC |
1 | 1 | 2 | 22-May-19 | 23-May-19 | OTHER |
1 | 1 | 3 | 23-May-19 | 23-Oct-19 | GC |
1 | 1 | 4 | 29-Jun-19 | 30-Jun-19 | OTHER |
1 | 1 | 6 | 14-Jul-19 | 15-Jul-19 | OTHER |
1 | 1 | 7 | 19-Jul-19 | 23-Jul-19 | OTHER |
1 | 1 | 5 | 25-Jul-19 | 8-Aug-19 | OTHER |
1 | 1 | 8 | 9-Aug-19 | 10-Aug-19 | OTHER |
1 | 1 | 9 | 11-Aug-19 | 12-Aug-19 | OTHER |
1 | 2 | 10 | 1-Nov-19 | 16-Apr-20 | URC |
2 | 3 | 11 | 19-Feb-16 | 1-Mar-16 | LRC |
2 | 3 | 12 | 5-Mar-16 | 14-Jun-16 | URC |
Here is what I want it to look like after:
Client_ID | Episode_ID | Event_ID | Begin_Date | End_Date | Event_Plcmnt |
1 | 1 | 1 | 20-May-19 | 22-May-19 | GC |
1 | 1 | 2 | 22-May-19 | 23-May-19 | OTHER |
1 | 1 | 3 | 23-May-19 | 29-Jun-19 | GC |
1 | 1 | 4 | 29-Jun-19 | 30-Jun-19 | OTHER |
1 | 1 | 3 | 30-Jun-19 | 14-Jul-19 | GC |
1 | 1 | 6 | 14-Jul-19 | 15-Jul-19 | OTHER |
1 | 1 | 3 | 15-Jul-19 | 19-Jul-19 | GC |
1 | 1 | 7 | 19-Jul-19 | 23-Jul-19 | OTHER |
1 | 1 | 3 | 23-Jul-19 | 25-Jul-19 | GC |
1 | 1 | 5 | 25-Jul-19 | 8-Aug-19 | OTHER |
1 | 1 | 3 | 8-Aug-19 | 9-Aug-19 | GC |
1 | 1 | 8 | 9-Aug-19 | 10-Aug-19 | OTHER |
1 | 1 | 3 | 10-Aug-19 | 11-Aug-19 | GC |
1 | 1 | 9 | 11-Aug-19 | 12-Aug-19 | OTHER |
1 | 1 | 3 | 12-Aug-19 | 23-Oct-19 | GC |
1 | 2 | 10 | 1-Nov-19 | 16-Apr-20 | URC |
2 | 3 | 11 | 19-Feb-16 | 5-Mar-16 | LRC |
2 | 3 | 12 | 5-Mar-16 | 14-Jun-16 | URC |
I've highlighted the new rows and changes to existing rows green.
The rules I am using:
1) If two events that are in the same episode that overlap the end_date of the older episode becomes the begin_date of the newer episode (see episode 3).
2) If there is a gap days between events in the same episode the end_date of the older episode becomes the begin_date of the newer episode (see episode 11).
3) If the dates of an event are encompassed by another event in the same episode then the end_date of the encompassing episode becomes the begin_date of the encompassed episode and a new row is created with the Event_ID and the Event_Plcmnt of the compassing episode that has the begin_date of the encompassed episode and either the end_date of the original encompassing event or, if there is more than one encompassed event, the begin_date of the next encompassed event.
Some notes about the data:
Here is the data in a usable form:
data have;
infile datalines truncover;
input Client_ID $ Episode_ID $ Event_ID $ Begin_Date End_Date Event_Plcmnt;
datalines;
001 001 001 20MAY2019 23MAY2019 GC
001 001 002 22MAY2019 23MAY2019 OTHER
001 001 003 23MAY2019 23OCT2019 GC
001 001 004 29JUN2019 30JUN2019 OTHER
001 001 006 14JUL2019 15JUL2019 OTHER
001 001 007 19JUL2019 23JUL2019 OTHER
001 001 005 25JUL2019 08AUG2019 OTHER
001 001 008 09AUG2019 10AUG2019 OTHER
001 001 009 11AUG2019 12AUG2019 OTHER
001 002 010 01NOV2019 16APR2020 URC
002 003 011 19FEB2016 01MAR2016 LRC
002 003 012 05MAR2016 14JUN2016 URC
;
Hopefully this is clear enough. I tried using the lag function but there can sometimes be 10 or more event nested inside another it got too complex to keep track of.
Best,
Marc
Please alter to answer your needs:
data BOUNDARIES(index=(A=(CLIENT_ID EPISODE_ID INTERVAL_START)));
merge HAVE
HAVE(firstobs=2 keep=CLIENT_ID EPISODE_ID BEGIN_DATE rename=(CLIENT_ID=C EPISODE_ID=E BEGIN_DATE=B));
retain INTERVAL_START INTERVAL_END;
format INTERVAL_END INTERVAL_START date9.;
if INTERVAL_END=. then INTERVAL_START=BEGIN_DATE;
if INTERVAL_END=END_DATE then END_FLAG+1;
INTERVAL_END=max(INTERVAL_END,END_DATE);
if END_FLAG %* max end date matched ;
| B > INTERVAL_END %* next start date > end date ;
| (E ne EPISODE_ID and C ne CLIENT_ID) %* new client or new episode ;
then do;
output;
call missing(INTERVAL_END, END_FLAG);
INTERVAL_NB+1;
end;
keep CLIENT_ID EPISODE_ID INTERVAL_START INTERVAL_END INTERVAL_NB;
run;
proc sort data=HAVE(keep=CLIENT_ID EPISODE_ID END_DATE) out=ENDS;
by CLIENT_ID EPISODE_ID END_DATE;
run;
data ALL_INTERVALS;
set HAVE(keep=CLIENT_ID EPISODE_ID BEGIN_DATE rename=(BEGIN_DATE=END_DATE))
ENDS ;
by CLIENT_ID EPISODE_ID END_DATE;
BEGIN_DATE=lag(END_DATE);
if ^first.EPISODE_ID and BEGIN_DATE ne END_DATE then output;
format BEGIN_DATE END_DATE date9.;
run;
data WANT;
merge HAVE (drop=END_DATE)
ALL_INTERVALS ;
by CLIENT_ID EPISODE_ID BEGIN_DATE;
set BOUNDARIES(rename=(INTERVAL_START=BEGIN_DATE)) key=A;
if _IORC_ then _ERROR_=0;
INTERVAL_END=coalesce(INTERVAL_END,lag(INTERVAL_END));
output;
if END_DATE = INTERVAL_END then INTERVAL_END=.;
retain INTERVAL:;
run;
Period MAY2020 - APR2021 |
Obs | CLIENT_ID | EPISODE_ID | EVENT_ID | BEGIN_DATE | EVENT_PLCMNT | END_DATE | INTERVAL_END | INTERVAL_NB |
---|---|---|---|---|---|---|---|---|
1 | 001 | 001 | 001 | 20MAY2019 | GC | 22MAY2019 | 23MAY2019 | 0 |
2 | 001 | 001 | 002 | 22MAY2019 | OTHER | 23MAY2019 | 23MAY2019 | 0 |
3 | 001 | 001 | 003 | 23MAY2019 | GC | 29JUN2019 | 23OCT2019 | 1 |
4 | 001 | 001 | 004 | 29JUN2019 | OTHER | 30JUN2019 | 23OCT2019 | 1 |
5 | 001 | 001 | 30JUN2019 | 14JUL2019 | 23OCT2019 | 1 | ||
6 | 001 | 001 | 006 | 14JUL2019 | OTHER | 15JUL2019 | 23OCT2019 | 1 |
7 | 001 | 001 | 15JUL2019 | 19JUL2019 | 23OCT2019 | 1 | ||
8 | 001 | 001 | 007 | 19JUL2019 | OTHER | 23JUL2019 | 23OCT2019 | 1 |
9 | 001 | 001 | 23JUL2019 | 25JUL2019 | 23OCT2019 | 1 | ||
10 | 001 | 001 | 005 | 25JUL2019 | OTHER | 08AUG2019 | 23OCT2019 | 1 |
11 | 001 | 001 | 08AUG2019 | 09AUG2019 | 23OCT2019 | 1 | ||
12 | 001 | 001 | 008 | 09AUG2019 | OTHER | 10AUG2019 | 23OCT2019 | 1 |
13 | 001 | 001 | 10AUG2019 | 11AUG2019 | 23OCT2019 | 1 | ||
14 | 001 | 001 | 009 | 11AUG2019 | OTHER | 12AUG2019 | 23OCT2019 | 1 |
15 | 001 | 001 | 12AUG2019 | 23OCT2019 | 23OCT2019 | 1 | ||
16 | 001 | 002 | 010 | 01NOV2019 | URC | 16APR2020 | 16APR2020 | 2 |
17 | 002 | 003 | 011 | 19FEB2016 | LRC | 01MAR2016 | 01MAR2016 | 3 |
18 | 002 | 003 | 01MAR2016 | 05MAR2016 | 01MAR2016 | 3 | ||
19 | 002 | 003 | 012 | 05MAR2016 | URC | 14JUN2016 | 14JUN2016 | 4 |
1. Your code does not run. Always paste it back to SAS and vet it please.
2. Please use the appropriate icon to paste code.
3. I would extract the dates, and then merge back by start date to the original data:
proc sort data=HAVE(keep=CLIENT_ID END_DATE) out=END;
by CLIENT_ID END_DATE;
run;
data INTERVALS;
format BEGIN_DATE END_DATE date9.;
set HAVE(keep=CLIENT_ID BEGIN_DATE rename=(BEGIN_DATE=END_DATE))
END ;
by CLIENT_ID END_DATE;
BEGIN_DATE=lag(END_DATE);
if ^first.CLIENT_ID then output;
run;
CLIENT_ID | BEGIN_DATE | END_DATE |
---|---|---|
001 | 20MAY2019 | 22MAY2019 |
001 | 22MAY2019 | 23MAY2019 |
001 | 23MAY2019 | 23MAY2019 |
001 | 23MAY2019 | 23MAY2019 |
001 | 23MAY2019 | 29JUN2019 |
001 | 29JUN2019 | 30JUN2019 |
001 | 30JUN2019 | 14JUL2019 |
001 | 14JUL2019 | 15JUL2019 |
001 | 15JUL2019 | 19JUL2019 |
001 | 19JUL2019 | 23JUL2019 |
001 | 23JUL2019 | 25JUL2019 |
001 | 25JUL2019 | 08AUG2019 |
001 | 08AUG2019 | 09AUG2019 |
001 | 09AUG2019 | 10AUG2019 |
001 | 10AUG2019 | 11AUG2019 |
001 | 11AUG2019 | 12AUG2019 |
001 | 12AUG2019 | 23OCT2019 |
001 | 23OCT2019 | 01NOV2019 |
001 | 01NOV2019 | 16APR2020 |
002 | 19FEB2016 | 01MAR2016 |
002 | 01MAR2016 | 05MAR2016 |
002 | 05MAR2016 | 14JUN2016 |
Sorry @ChrisNZ . I was rushing and I apologize. I will try not to let happen again. I'll try to run your code in the morning.
data have;
infile datalines truncover;
input Client_ID $ Episode_ID $ Event_ID $ Begin_Date date9. End_Date date9. Event_Plcmnt $;
datalines;
001 001 001 20MAY2019 23MAY2019 GC
001 001 002 22MAY2019 23MAY2019 OTHER
001 001 003 23MAY2019 23OCT2019 GC
001 001 004 29JUN2019 30JUN2019 OTHER
001 001 006 14JUL2019 15JUL2019 OTHER
001 001 007 19JUL2019 23JUL2019 OTHER
001 001 005 25JUL2019 08AUG2019 OTHER
001 001 008 09AUG2019 10AUG2019 OTHER
001 001 009 11AUG2019 12AUG2019 OTHER
001 002 010 01NOV2019 16APR2020 URC
002 003 011 19FEB2016 01MAR2016 LRC
002 003 012 05MAR2016 14JUN2016 URC
;
@
It seems to have generated some extra lines.(in red)..is there any way to fix it?
CLIENT_ID BEGIN_DATE END_DATE
001 | 20MAY2019 | 22MAY2019 |
001 | 22MAY2019 | 23MAY2019 |
001 | 23MAY2019 | 23MAY2019 |
001 | 23MAY2019 | 23MAY2019 |
001 | 23MAY2019 | 29JUN2019 |
001 | 29JUN2019 | 30JUN2019 |
001 | 30JUN2019 | 14JUL2019 |
001 | 14JUL2019 | 15JUL2019 |
001 | 15JUL2019 | 19JUL2019 |
001 | 19JUL2019 | 23JUL2019 |
001 | 23JUL2019 | 25JUL2019 |
001 | 25JUL2019 | 08AUG2019 |
001 | 08AUG2019 | 09AUG2019 |
001 | 09AUG2019 | 10AUG2019 |
001 | 10AUG2019 | 11AUG2019 |
001 | 11AUG2019 | 12AUG2019 |
001 | 12AUG2019 | 23OCT2019 |
001 | 23OCT2019 | 01NOV2019 |
001 | 01NOV2019 | 16APR2020 |
002 | 19FEB2016 | 01MAR2016 |
002 | 01MAR2016 | 05MAR2016 |
002 | 05MAR2016 | 14JUN2016 |
The first 2 are easy to remove.
if ^first.CLIENT_ID and START_DATE ne END_DATE then output;
I don't know why the last 2 are unwanted.
Sorry I highlighted the wrong one. This is what I meant to do:
CLIENT_ID | BEGIN_DATE | END_DATE |
1 | 20-May-19 | 22-May-19 |
1 | 22-May-19 | 23-May-19 |
1 | 23-May-19 | 23-May-19 |
1 | 23-May-19 | 23-May-19 |
1 | 23-May-19 | 29-Jun-19 |
1 | 29-Jun-19 | 30-Jun-19 |
1 | 30-Jun-19 | 14-Jul-19 |
1 | 14-Jul-19 | 15-Jul-19 |
1 | 15-Jul-19 | 19-Jul-19 |
1 | 19-Jul-19 | 23-Jul-19 |
1 | 23-Jul-19 | 25-Jul-19 |
1 | 25-Jul-19 | 8-Aug-19 |
1 | 8-Aug-19 | 9-Aug-19 |
1 | 9-Aug-19 | 10-Aug-19 |
1 | 10-Aug-19 | 11-Aug-19 |
1 | 11-Aug-19 | 12-Aug-19 |
1 | 12-Aug-19 | 23-Oct-19 |
1 | 23-Oct-19 | 1-Nov-19 |
1 | 1-Nov-19 | 16-Apr-20 |
2 | 19-Feb-16 | 1-Mar-16 |
2 | 1-Mar-16 | 5-Mar-16 |
2 | 5-Mar-16 | 14-Jun-16 |
The row with the begin_date 23OCT19 shouldn't be there because the event with the 1NOV19 begin_date has a different ID_Episode than the event with the begin_date of 12AUG19.
The row with the begin_date 01MAR16 shouldn't be there because the event with the begin_date 19FEB16 doesn't overlap the event with the begin date 05MAR16. There is a gap between in the dates between the events. When this occurs I need the end_date of the row beginning 19FEB16 to change to the begin_date of the following event (so 05MAR16). I hope I am being clear...it is hard to explain these thing properly via this forum for me.
Please alter to answer your needs:
data BOUNDARIES(index=(A=(CLIENT_ID EPISODE_ID INTERVAL_START)));
merge HAVE
HAVE(firstobs=2 keep=CLIENT_ID EPISODE_ID BEGIN_DATE rename=(CLIENT_ID=C EPISODE_ID=E BEGIN_DATE=B));
retain INTERVAL_START INTERVAL_END;
format INTERVAL_END INTERVAL_START date9.;
if INTERVAL_END=. then INTERVAL_START=BEGIN_DATE;
if INTERVAL_END=END_DATE then END_FLAG+1;
INTERVAL_END=max(INTERVAL_END,END_DATE);
if END_FLAG %* max end date matched ;
| B > INTERVAL_END %* next start date > end date ;
| (E ne EPISODE_ID and C ne CLIENT_ID) %* new client or new episode ;
then do;
output;
call missing(INTERVAL_END, END_FLAG);
INTERVAL_NB+1;
end;
keep CLIENT_ID EPISODE_ID INTERVAL_START INTERVAL_END INTERVAL_NB;
run;
proc sort data=HAVE(keep=CLIENT_ID EPISODE_ID END_DATE) out=ENDS;
by CLIENT_ID EPISODE_ID END_DATE;
run;
data ALL_INTERVALS;
set HAVE(keep=CLIENT_ID EPISODE_ID BEGIN_DATE rename=(BEGIN_DATE=END_DATE))
ENDS ;
by CLIENT_ID EPISODE_ID END_DATE;
BEGIN_DATE=lag(END_DATE);
if ^first.EPISODE_ID and BEGIN_DATE ne END_DATE then output;
format BEGIN_DATE END_DATE date9.;
run;
data WANT;
merge HAVE (drop=END_DATE)
ALL_INTERVALS ;
by CLIENT_ID EPISODE_ID BEGIN_DATE;
set BOUNDARIES(rename=(INTERVAL_START=BEGIN_DATE)) key=A;
if _IORC_ then _ERROR_=0;
INTERVAL_END=coalesce(INTERVAL_END,lag(INTERVAL_END));
output;
if END_DATE = INTERVAL_END then INTERVAL_END=.;
retain INTERVAL:;
run;
Period MAY2020 - APR2021 |
Obs | CLIENT_ID | EPISODE_ID | EVENT_ID | BEGIN_DATE | EVENT_PLCMNT | END_DATE | INTERVAL_END | INTERVAL_NB |
---|---|---|---|---|---|---|---|---|
1 | 001 | 001 | 001 | 20MAY2019 | GC | 22MAY2019 | 23MAY2019 | 0 |
2 | 001 | 001 | 002 | 22MAY2019 | OTHER | 23MAY2019 | 23MAY2019 | 0 |
3 | 001 | 001 | 003 | 23MAY2019 | GC | 29JUN2019 | 23OCT2019 | 1 |
4 | 001 | 001 | 004 | 29JUN2019 | OTHER | 30JUN2019 | 23OCT2019 | 1 |
5 | 001 | 001 | 30JUN2019 | 14JUL2019 | 23OCT2019 | 1 | ||
6 | 001 | 001 | 006 | 14JUL2019 | OTHER | 15JUL2019 | 23OCT2019 | 1 |
7 | 001 | 001 | 15JUL2019 | 19JUL2019 | 23OCT2019 | 1 | ||
8 | 001 | 001 | 007 | 19JUL2019 | OTHER | 23JUL2019 | 23OCT2019 | 1 |
9 | 001 | 001 | 23JUL2019 | 25JUL2019 | 23OCT2019 | 1 | ||
10 | 001 | 001 | 005 | 25JUL2019 | OTHER | 08AUG2019 | 23OCT2019 | 1 |
11 | 001 | 001 | 08AUG2019 | 09AUG2019 | 23OCT2019 | 1 | ||
12 | 001 | 001 | 008 | 09AUG2019 | OTHER | 10AUG2019 | 23OCT2019 | 1 |
13 | 001 | 001 | 10AUG2019 | 11AUG2019 | 23OCT2019 | 1 | ||
14 | 001 | 001 | 009 | 11AUG2019 | OTHER | 12AUG2019 | 23OCT2019 | 1 |
15 | 001 | 001 | 12AUG2019 | 23OCT2019 | 23OCT2019 | 1 | ||
16 | 001 | 002 | 010 | 01NOV2019 | URC | 16APR2020 | 16APR2020 | 2 |
17 | 002 | 003 | 011 | 19FEB2016 | LRC | 01MAR2016 | 01MAR2016 | 3 |
18 | 002 | 003 | 01MAR2016 | 05MAR2016 | 01MAR2016 | 3 | ||
19 | 002 | 003 | 012 | 05MAR2016 | URC | 14JUN2016 | 14JUN2016 | 4 |
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.