Solved: Re: Assigning rows with blank groupID to existing groups

tk89 · Posted 04-14-2020 01:59 AM

Hello,

I am trying to do a vertical comparison of my data and am unable to figure it out.

I have a dataset with patient claim data. Some claims are grouped into episodes while the others are left ungrouped(episode=0) by a previous step. An episode can contain one or more rows. Here's a mockup of my data:

data test;
input patID order startdt enddt Episode rank;
informat startdt yymmdd. enddt yymmdd.;
format startdt yymmddn8. enddt yymmddn8.;
cards;
101 1 20120120 20120120 0 0
101 2 20120131 20120131 97 145
101 3 20120211 20120211 96 197
101 4 20120213 20120319 92 510
101 5 20120226 20120301 97 145
101 6 20120227 20120227 95 229
101 7 20120301 20120401 97 145
101 8 20120317 20120617 0 0 
101 9 20120421 20120421 0 0
101 10 20120501 20120701 0 0
101 11 20120602 20120902 98 513
101 12 20120630 20120630 0 0
101 13 20120701 20120701 0 0
101 15 20120804 20120804 0 0
;
run;

The goal is to assign these claims with episode=0 to existing episodes based on the following steps:

1. For each episode, the start date=min(startdt) of all claims in the episode and the end date=max(enddt) of claims in the episode. So, for episode 97, episode startdt=20120131 and enddt=20120401.

2. The start date of an orphan claim should lie between episode startdt and episode enddt. If there is more than one episode that satisfies this condition, then pick the one with the lowest rank.

3. Once the orphan claim is added to an episode, recalculate the episode enddt and then proceed to the next claim.

So, for the data above, the first row is an orphan claim. But since its dates do not overlap with any episode dates, we leave it as is and then go to the next claim with episode=0. For row 8, the startdt=20120317. So, it can either be added to episode 92(start=20120213, end=20120319) or episode 97(start=20120131, end=20120401). Since episode 97 has the lowest rank of the two, row8 is assigned to episode 97. Now, episodeID for row8 is 97 instead of 0, and enddt for episode 97 is updated to 20120617 since that is the maximum. For row9, the episode is set to 97 as well since its startdt(20120421) falls between start and updated end of episode 97(20120131, 20120617). This is continued till we reach the last row.

Here's what I tried: I created a lookup dataset with non zero episodes, date and rank. Then I made a left join with my table to create a cartesian product. Then I picked the row that satisfied the date overlap condition. This works to an extent, but not completely as I am unable to dynamically update the episode end date to check the overlap. Any advice of how to go about achieving this would be greatly beneficial. Thank you!

ChrisNZ · Posted 04-15-2020 02:19 AM

Something like this?

You'll have to iterate until the table ORPHANS is empty.

proc sql;
  create table ORPHANS as 
  select a.PATID
       , a.ORDER
       , a.STARTDT 
       , b.ENDDT
       , b.EPISODE
       , b.RANK 
  from HAVE(where=(EPISODE=0))  a
         left join
       HAVE(where=(EPISODE>0))  b
         on  a.PATID=b.PATID
         and a.STARTDT between b.STARTDT and b.ENDDT
  group by a.PATID, a.ORDER
  having b.RANK=min(b.RANK)
  order by PATID, ORDER, RANK
  ;
  create table OUT as
  select a.PATID
       , a.ORDER
       , a.STARTDT 
       , max(a.ENDDT  ,b.ENDDT  ) as ENDDT format=yymmdd8.
       , max(a.EPISODE,b.EPISODE) as EPISODE
       , max(a.RANK   ,b.RANK   ) as RANK
  from HAVE                    a
         left join
       ORPHANS(where=(RANK))   b 
         on  a.PATID=b.PATID
         and a.ORDER=b.ORDER;

High-Performance SAS Coding - Third Edition

View solution in original post

ChrisNZ · Posted 04-15-2020 02:19 AM

Something like this?

You'll have to iterate until the table ORPHANS is empty.

proc sql;
  create table ORPHANS as 
  select a.PATID
       , a.ORDER
       , a.STARTDT 
       , b.ENDDT
       , b.EPISODE
       , b.RANK 
  from HAVE(where=(EPISODE=0))  a
         left join
       HAVE(where=(EPISODE>0))  b
         on  a.PATID=b.PATID
         and a.STARTDT between b.STARTDT and b.ENDDT
  group by a.PATID, a.ORDER
  having b.RANK=min(b.RANK)
  order by PATID, ORDER, RANK
  ;
  create table OUT as
  select a.PATID
       , a.ORDER
       , a.STARTDT 
       , max(a.ENDDT  ,b.ENDDT  ) as ENDDT format=yymmdd8.
       , max(a.EPISODE,b.EPISODE) as EPISODE
       , max(a.RANK   ,b.RANK   ) as RANK
  from HAVE                    a
         left join
       ORPHANS(where=(RANK))   b 
         on  a.PATID=b.PATID
         and a.ORDER=b.ORDER;

High-Performance SAS Coding - Third Edition

tk89 · Posted 04-20-2020 04:27 PM

Hi @ChrisNZ ,

This is great, thanks! I will put this in a loop that ends when the orphans table is unchanged between two iterations since it's okay to still have orphan claims at this end of this stage (the first row in my example will remain unmerged into any of the episodes since its date range doesn't overlap with any existing episode). Thank you very much for helping me out!

tk89 · Posted 04-22-2020 06:24 PM

Hi Chris,
I jumped the gun here, this does solve part of my problem, but not completely. The reason is the iteration here. For example, in the sample above, let's say the enddt for episode 92 is 20120422 instead of 20120319. Now, in the first iteration, the orphans table would be created with row9 containing episode=92 and that would be in the output. But, it should be actually added to episode 97 since row8 being added to episode97 has now expanded the enddt and it is of the lower rank. I am unable to think of a way I can do this dynamically before going to the next row.
Thanks.

ChrisNZ · Posted 04-24-2020 06:12 PM

Isn't this what is taken care of by the iterations ?

Also maybe the join that creates ORPHANS should be an inner join.

High-Performance SAS Coding - Third Edition

tk89 · Posted 04-29-2020 05:44 PM

@ChrisNZ No, because the iterations only look at rows where episode=0. Once an episode value is assigned to a row, it does not update the row in any way. The orphans table only takes the rows with episode=0 into account.

For simplicity's sake, consider this dataset:

data have;
   input patID order startdt enddt Episode rank;
   informat startdt yymmdd. enddt yymmdd.;
   format startdt yymmddn8. enddt yymmddn8.;
   cards;
   101 1 20120120 20120120 0 0
   101 2 20120131 20120131 97 145
   101 4 20120213 20120422 92 510
   101 5 20120226 20120301 97 145
   101 7 20120301 20120401 97 145
   101 8 20120317 20120617 0 0 
   101 9 20120421 20120421 0 0
   ;
run;

When I run the code, orphans dataset contains rows 8 and 9. When we perform the left join to create OUT dataset, this is what we get:

patID order startdt ENDDT EPISODE RANK
101 1 20120120 20120120 0 0
101 2 20120131 20120131 97 145
101 4 20120213 20120422 92 510
101 5 20120226 20120301 97 145
101 7 20120301 20120401 97 145
101 8 20120317 20120617 97 145
101 9 20120421 20120422 92 510

Here, row8 contains episode=97 and row9 contains episode=92 because of the dates in the input. But, row8 being added to episode97 should expand the end date to 20120617 thus pulling row9 into the same episode as well. So, row9 should be episode=97, not episode=92(since out of 92 and 97, the lowest rank is of 97). So, before we exit the data step , is there a way to update the dates after each row?

ChrisNZ · Posted 04-29-2020 06:22 PM

Something like this then:

  create table OUT as
  select a.PATID
       , a.ORDER
       , a.STARTDT 
       , max(c.ENDDT  ,b.ENDDT  ) as ENDDT format=yymmdd8.
       , max(a.EPISODE,b.EPISODE) as EPISODE
       , max(a.RANK   ,b.RANK   ) as RANK
  from HAVE                    a
         left join
       ORPHANS(where=(RANK))   b 
         on  a.PATID=b.PATID
         and a.ORDER=b.ORDER
         left join
       (select PATID, EPISODE, max(ENDDT) as ENDDT from HAVE group by 1,2 ) c 
         on a.PATID=c.PATID and max(a.EPISODE,b.EPISODE)=c.EPISODE;

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 04-29-2020 06:24 PM

Or maybe it's better to redefine the end dates in a third query rather than an additional left join.

High-Performance SAS Coding - Third Edition

tk89 · Posted 05-01-2020 02:08 PM

This removes the need for the second iteration, but still doesn't give the result I posted, which is the row with order=9 being assigned to episode 97 and not 92. I will keep at it and see if there's any other way this can be done. Thank you for your help 🙂

kelxxx · Posted 04-15-2020 04:03 AM

Hello,

Can you post the output table you want?

Thanks.

tk89 · Posted 04-20-2020 04:22 PM

Hi @kelxxx ,

Here's the output I want to get:

  data want;
   input patID order startdt enddt Episode rank;
   informat startdt yymmdd. enddt yymmdd.;
   format startdt yymmddn8. enddt yymmddn8.;
   cards;
   101 1 20120120 20120120 0 0
   101 2 20120131 20120131 97 145
   101 3 20120211 20120211 96 197
   101 4 20120213 20120319 92 510
   101 5 20120226 20120301 97 145
   101 6 20120227 20120227 95 229
   101 7 20120301 20120401 97 145
   101 8 20120317 20120617 97 145
   101 9 20120421 20120421 97 145
   101 10 20120501 20120701 97 145
   101 11 20120602 20120902 98 513
   101 12 20120630 20120630 98 513
   101 13 20120701 20120701 98 513
   101 15 20120804 20120804 98 513
   ;
run;

Registration is open

SAS Training: Just a Click Away