Hello,
I am trying to do a vertical comparison of my data and am unable to figure it out.
I have a dataset with patient claim data. Some claims are grouped into episodes while the others are left ungrouped(episode=0) by a previous step. An episode can contain one or more rows. Here's a mockup of my data:
data test;
input patID order startdt enddt Episode rank;
informat startdt yymmdd. enddt yymmdd.;
format startdt yymmddn8. enddt yymmddn8.;
cards;
101 1 20120120 20120120 0 0
101 2 20120131 20120131 97 145
101 3 20120211 20120211 96 197
101 4 20120213 20120319 92 510
101 5 20120226 20120301 97 145
101 6 20120227 20120227 95 229
101 7 20120301 20120401 97 145
101 8 20120317 20120617 0 0
101 9 20120421 20120421 0 0
101 10 20120501 20120701 0 0
101 11 20120602 20120902 98 513
101 12 20120630 20120630 0 0
101 13 20120701 20120701 0 0
101 15 20120804 20120804 0 0
;
run;
The goal is to assign these claims with episode=0 to existing episodes based on the following steps:
1. For each episode, the start date=min(startdt) of all claims in the episode and the end date=max(enddt) of claims in the episode. So, for episode 97, episode startdt=20120131 and enddt=20120401.
2. The start date of an orphan claim should lie between episode startdt and episode enddt. If there is more than one episode that satisfies this condition, then pick the one with the lowest rank.
3. Once the orphan claim is added to an episode, recalculate the episode enddt and then proceed to the next claim.
So, for the data above, the first row is an orphan claim. But since its dates do not overlap with any episode dates, we leave it as is and then go to the next claim with episode=0. For row 8, the startdt=20120317. So, it can either be added to episode 92(start=20120213, end=20120319) or episode 97(start=20120131, end=20120401). Since episode 97 has the lowest rank of the two, row8 is assigned to episode 97. Now, episodeID for row8 is 97 instead of 0, and enddt for episode 97 is updated to 20120617 since that is the maximum. For row9, the episode is set to 97 as well since its startdt(20120421) falls between start and updated end of episode 97(20120131, 20120617). This is continued till we reach the last row.
Here's what I tried: I created a lookup dataset with non zero episodes, date and rank. Then I made a left join with my table to create a cartesian product. Then I picked the row that satisfied the date overlap condition. This works to an extent, but not completely as I am unable to dynamically update the episode end date to check the overlap. Any advice of how to go about achieving this would be greatly beneficial. Thank you!
Something like this?
You'll have to iterate until the table ORPHANS is empty.
proc sql;
create table ORPHANS as
select a.PATID
, a.ORDER
, a.STARTDT
, b.ENDDT
, b.EPISODE
, b.RANK
from HAVE(where=(EPISODE=0)) a
left join
HAVE(where=(EPISODE>0)) b
on a.PATID=b.PATID
and a.STARTDT between b.STARTDT and b.ENDDT
group by a.PATID, a.ORDER
having b.RANK=min(b.RANK)
order by PATID, ORDER, RANK
;
create table OUT as
select a.PATID
, a.ORDER
, a.STARTDT
, max(a.ENDDT ,b.ENDDT ) as ENDDT format=yymmdd8.
, max(a.EPISODE,b.EPISODE) as EPISODE
, max(a.RANK ,b.RANK ) as RANK
from HAVE a
left join
ORPHANS(where=(RANK)) b
on a.PATID=b.PATID
and a.ORDER=b.ORDER;
Something like this?
You'll have to iterate until the table ORPHANS is empty.
proc sql;
create table ORPHANS as
select a.PATID
, a.ORDER
, a.STARTDT
, b.ENDDT
, b.EPISODE
, b.RANK
from HAVE(where=(EPISODE=0)) a
left join
HAVE(where=(EPISODE>0)) b
on a.PATID=b.PATID
and a.STARTDT between b.STARTDT and b.ENDDT
group by a.PATID, a.ORDER
having b.RANK=min(b.RANK)
order by PATID, ORDER, RANK
;
create table OUT as
select a.PATID
, a.ORDER
, a.STARTDT
, max(a.ENDDT ,b.ENDDT ) as ENDDT format=yymmdd8.
, max(a.EPISODE,b.EPISODE) as EPISODE
, max(a.RANK ,b.RANK ) as RANK
from HAVE a
left join
ORPHANS(where=(RANK)) b
on a.PATID=b.PATID
and a.ORDER=b.ORDER;
Hi @ChrisNZ ,
This is great, thanks! I will put this in a loop that ends when the orphans table is unchanged between two iterations since it's okay to still have orphan claims at this end of this stage (the first row in my example will remain unmerged into any of the episodes since its date range doesn't overlap with any existing episode). Thank you very much for helping me out!
Isn't this what is taken care of by the iterations ?
Also maybe the join that creates ORPHANS should be an inner join.
@ChrisNZ No, because the iterations only look at rows where episode=0. Once an episode value is assigned to a row, it does not update the row in any way. The orphans table only takes the rows with episode=0 into account.
For simplicity's sake, consider this dataset:
data have;
input patID order startdt enddt Episode rank;
informat startdt yymmdd. enddt yymmdd.;
format startdt yymmddn8. enddt yymmddn8.;
cards;
101 1 20120120 20120120 0 0
101 2 20120131 20120131 97 145
101 4 20120213 20120422 92 510
101 5 20120226 20120301 97 145
101 7 20120301 20120401 97 145
101 8 20120317 20120617 0 0
101 9 20120421 20120421 0 0
;
run;
When I run the code, orphans dataset contains rows 8 and 9. When we perform the left join to create OUT dataset, this is what we get:
patID order startdt ENDDT EPISODE RANK
101 1 20120120 20120120 0 0
101 2 20120131 20120131 97 145
101 4 20120213 20120422 92 510
101 5 20120226 20120301 97 145
101 7 20120301 20120401 97 145
101 8 20120317 20120617 97 145
101 9 20120421 20120422 92 510
Here, row8 contains episode=97 and row9 contains episode=92 because of the dates in the input. But, row8 being added to episode97 should expand the end date to 20120617 thus pulling row9 into the same episode as well. So, row9 should be episode=97, not episode=92(since out of 92 and 97, the lowest rank is of 97). So, before we exit the data step , is there a way to update the dates after each row?
Something like this then:
create table OUT as
select a.PATID
, a.ORDER
, a.STARTDT
, max(c.ENDDT ,b.ENDDT ) as ENDDT format=yymmdd8.
, max(a.EPISODE,b.EPISODE) as EPISODE
, max(a.RANK ,b.RANK ) as RANK
from HAVE a
left join
ORPHANS(where=(RANK)) b
on a.PATID=b.PATID
and a.ORDER=b.ORDER
left join
(select PATID, EPISODE, max(ENDDT) as ENDDT from HAVE group by 1,2 ) c
on a.PATID=c.PATID and max(a.EPISODE,b.EPISODE)=c.EPISODE;
Or maybe it's better to redefine the end dates in a third query rather than an additional left join.
Hello,
Can you post the output table you want?
Thanks.
Hi @kelxxx ,
Here's the output I want to get:
data want;
input patID order startdt enddt Episode rank;
informat startdt yymmdd. enddt yymmdd.;
format startdt yymmddn8. enddt yymmddn8.;
cards;
101 1 20120120 20120120 0 0
101 2 20120131 20120131 97 145
101 3 20120211 20120211 96 197
101 4 20120213 20120319 92 510
101 5 20120226 20120301 97 145
101 6 20120227 20120227 95 229
101 7 20120301 20120401 97 145
101 8 20120317 20120617 97 145
101 9 20120421 20120421 97 145
101 10 20120501 20120701 97 145
101 11 20120602 20120902 98 513
101 12 20120630 20120630 98 513
101 13 20120701 20120701 98 513
101 15 20120804 20120804 98 513
;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.