BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
chuakp
Obsidian | Level 7

I have an insurance claims dataset and am looking to use the enrollment file to subset to people who were enrolled for at least 14 days after a particular procedure (the date of the procedure is variable PROCDATE).  The enrollment file is set up where each observation is a particular enrollment period for a month, demarcated by the variables START and END.  People who are enrolled in all twelve months have twelve observations, one for each month.  However, in some cases it's also possible for someone to have more than twelve observations (for example, if they are enrolled January 1, 2010 to January 10, 2010 and then January 20, 2010 to January 31, 2010, they would have two observations for the month of January).  

 

Here's what the data look like.  ID 1 and 2 should be included, as should ID 4, but ID 3 should be excluded since fourteen days beyond the procedure date of 8/29/2015 is 9/12/2015, and their enrollment stopped 8/31/2015.

 

I've tried playing with proc transpose but no luck.  Any advice would be appreciated.  Thanks!

 

ID PROCDATE START END
1 3/31/2015 1/1/2015 1/31/2015
1 3/31/2015 2/1/2015 2/28/2015
1 3/31/2015 3/1/2015 3/31/2015
1 3/31/2015 4/1/2015 4/30/2015
1 3/31/2015 5/1/2015 5/31/2015
1 3/31/2015 6/1/2015 6/30/2015
1 3/31/2015 7/1/2015 7/31/2015
1 3/31/2015 8/1/2015 8/31/2015
1 3/31/2015 9/1/2015 9/30/2015
1 3/31/2015 10/1/2015 10/31/2015
1 3/31/2015 11/1/2015 11/30/2015
1 3/31/2015 12/1/2015 12/31/2015
2 7/25/2015 1/1/2015 1/10/2015
2 7/25/2015 1/20/2015 1/31/2015
2 7/25/2015 2/1/2015 2/28/2015
2 7/25/2015 3/1/2015 3/31/2015
2 7/25/2015 4/1/2015 4/30/2015
2 7/25/2015 5/1/2015 5/31/2015
2 7/25/2015 6/1/2015 6/30/2015
2 7/25/2015 7/1/2015 7/31/2015
2 7/25/2015 8/1/2015 8/31/2015
2 7/25/2015 9/1/2015 9/30/2015
2 7/25/2015 10/1/2015 10/31/2015
2 7/25/2015 11/1/2015 11/30/2015
2 7/25/2015 12/1/2015 12/31/2015
3 8/29/2015 8/1/2015 8/31/2015
4 1/20/2015 1/1/2015 1/31/2015
4 1/20/2015 2/1/2015 2/28/2015

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

You want to be able to consider a date range from procdate to procdate+13, where that range fits entirely in a single record, or possibly a pair of records (where the 2nd record in the starts one day after the first of the pair).  I guess you could have 3 successive records in the time span if each record only had a few days, and they are "contiguous".

 

The "trick", I think is to include only records that are both 

  1. Late enough (i.e. end>=procdate)
  2. Early enought (start<=procdate+13)

Then with those records you populate a 14-element array (TDATE) with 1's for dates that are found in the record and are between procdate+0 and procdate+13:

 

data have;
  input ID PROCDATE :mmddyy10. START  :mmddyy10. END  :mmddyy10. ;
  format procdate start end yymmddn8.;
datalines;
1 3/31/2015 1/1/2015 1/31/2015 
1 3/31/2015 2/1/2015 2/28/2015 
1 3/31/2015 3/1/2015 3/31/2015 
1 3/31/2015 4/1/2015 4/30/2015 
1 3/31/2015 5/1/2015 5/31/2015 
1 3/31/2015 6/1/2015 6/30/2015 
1 3/31/2015 7/1/2015 7/31/2015 
1 3/31/2015 8/1/2015 8/31/2015 
1 3/31/2015 9/1/2015 9/30/2015 
1 3/31/2015 10/1/2015 10/31/2015 
1 3/31/2015 11/1/2015 11/30/2015 
1 3/31/2015 12/1/2015 12/31/2015 
2 7/25/2015 1/1/2015 1/10/2015 
2 7/25/2015 1/20/2015 1/31/2015 
2 7/25/2015 2/1/2015 2/28/2015 
2 7/25/2015 3/1/2015 3/31/2015 
2 7/25/2015 4/1/2015 4/30/2015 
2 7/25/2015 5/1/2015 5/31/2015 
2 7/25/2015 6/1/2015 6/30/2015 
2 7/25/2015 7/1/2015 7/31/2015 
2 7/25/2015 8/1/2015 8/31/2015 
2 7/25/2015 9/1/2015 9/30/2015 
2 7/25/2015 10/1/2015 10/31/2015 
2 7/25/2015 11/1/2015 11/30/2015 
2 7/25/2015 12/1/2015 12/31/2015 
3 8/29/2015 8/1/2015 8/31/2015 
4 1/20/2015 1/1/2015 1/31/2015 
4 1/20/2015 2/1/2015 2/28/2015 
run;

data want (keep=id procdate);
  array tdate{0:13} _temporary_;
  set have;
  by id;
  where procdate<=end;            /* Exclude early periods*/
  where also start<=procdate+13;  /* Exclude late periods */

  if first.id then do;
    call missing(of tdate{*});
    dstart=procdate;
  end;
  else dstart=start;

  do d=dstart to min(procdate+13,end);
    tdate{d-procdate}=1;
  end;
  
  if last.id and sum(of tdate{*})=14;
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

10 REPLIES 10
LaurieF
Barite | Level 11

I'm not sure exactly what you're striving for - is this it?

data want;
set have;
where procdate + 14 between start and end;
run;

 

chuakp
Obsidian | Level 7

Sorry for any confusion.  Basically, for ID 1, PROCDATE is 3/31/2015.  You can see from observation 3 that this person was enrolled between 3/1/2015 and 3/31/2015, and from observation 4 that this person was enrolled from 4/1/2015 and 4/30/2015.  The 14-day period after PROCDATE is 3/31/2015-4/13/2015, so this person would be kept in the dataset since they had 14 days of coverage after the procedure.  

 

The tricky part is trying to tell SAS when to look at the next observation's START and END dates when the 14 day period crosses months.  One thought would be  to set up variables describing the periods of enrollment for each ID (e.g., for ID 1, this person was continuously enrolled for 365 days; the start date of enrollment would be 1/1/2015 and the end date would be 12/31/2015.  If I had the data set up this way, I could just look in between the start and end dates to see if the 14-day period after PROCDATE was contained within these dates.

 

Hope this makes sense.  Thanks.

LaurieF
Barite | Level 11

Well, how about this? By turning the source table into a slowly changing dimension, as long as start and end dates are always contiguous, you could have multiple procdates per id and it would still work.

proc sql;
create table have_collapse as
   select id,
          min(start) as start format=date9.,
          max(end) as end format=date9.
     from have
    group by id;
quit;

data id_proc;
set have;
by id procdate;
if first.procdate;
keep id procdate;
run;

proc sql;
create table want as
   select ip.id,
          ip.procdate
     from id_proc as ip
    inner join have_collapse as hc
       on ip.id = hc.id
      and ip.procdate + 14 between hc.start and hc.end
     order by ip.id,
              ip.procdate;
quit;

 

(I use date9. format because, not being USian, mmddyy. does my head in! <grin>) 

chuakp
Obsidian | Level 7

Thanks for this idea.  Looking for PROCDATE to be between the minimum of START and the maximum of END is fine in most cases, but as you point out, you could have a weird situation where the dates are not contiguous from one observation to the next.  For example:

 

ID    PROCDATE          START            END

5      3/31/2015             3/1/2015          3/31/2015 

5      3/31/2015             5/1/2015          5/30/2015

 

Here the minimum of START is 3/1/2015 and the maximum of END is 5/30/2015.  PROCDATE is between these two but the person was not enrolled in the 14 days after 3/31/2015.  

 

Thanks.

LaurieF
Barite | Level 11

Fair enough - <shudder> this means generating rows for missing periods. I hate that, but it's reality I s'pose. Ahem. This may appear to be going around the houses, but it's really efficient, and will always work, no matter your data volumes. Unless I've created a bug somewhere.

 

/*
   Sort by descending end date. We can't predict the future, but we can predict the past. 
*/
proc sort data=have(keep=id start end)
     out=have_collapse noequals;
by id descending end;
run;

data have_collapse;
set have_collapse;
by id;
retain save_date;
format save_date date9.;
attrib enrolled length=$ 1;
enrolled = 'E';
output;
if not first.id then 
   if end ne save_date - 1 then do;
      start = end + 1;
      end = save_date - 1;
      enrolled = 'U';
      output;
      end;
save_date = start;
keep id start end enrolled;
run;

proc sort data=have_collapse noequals;
by id start;
run;

/*
   The *notsorted* on the by statement is because the dates aren't important. This is
   just to look for where enrolled has changed, allowing first.enrolled and last.enrolled.
   Of course it's possible for there to be first/last.enrolled on the same row.
*/
data have_collapse;
set have_collapse;
by id enrolled notsorted; 
retain save_date;
format save_date date9.;
if first.enrolled then
   save_date = start;
if last.enrolled;
start = save_date;
keep id start end enrolled;
run;

data id_proc;
set have;
by id procdate;
if first.procdate;
keep id procdate;
run;

proc sql;
create table want as
   select ip.id,
          ip.procdate
     from id_proc as ip
    inner join have_collapse as hc
       on ip.id = hc.id
      and ip.procdate + 14 between hc.start and hc.end
      and hc.enrolled = 'E'
     order by ip.id,
              ip.procdate;
quit;
mkeintz
PROC Star

You want to be able to consider a date range from procdate to procdate+13, where that range fits entirely in a single record, or possibly a pair of records (where the 2nd record in the starts one day after the first of the pair).  I guess you could have 3 successive records in the time span if each record only had a few days, and they are "contiguous".

 

The "trick", I think is to include only records that are both 

  1. Late enough (i.e. end>=procdate)
  2. Early enought (start<=procdate+13)

Then with those records you populate a 14-element array (TDATE) with 1's for dates that are found in the record and are between procdate+0 and procdate+13:

 

data have;
  input ID PROCDATE :mmddyy10. START  :mmddyy10. END  :mmddyy10. ;
  format procdate start end yymmddn8.;
datalines;
1 3/31/2015 1/1/2015 1/31/2015 
1 3/31/2015 2/1/2015 2/28/2015 
1 3/31/2015 3/1/2015 3/31/2015 
1 3/31/2015 4/1/2015 4/30/2015 
1 3/31/2015 5/1/2015 5/31/2015 
1 3/31/2015 6/1/2015 6/30/2015 
1 3/31/2015 7/1/2015 7/31/2015 
1 3/31/2015 8/1/2015 8/31/2015 
1 3/31/2015 9/1/2015 9/30/2015 
1 3/31/2015 10/1/2015 10/31/2015 
1 3/31/2015 11/1/2015 11/30/2015 
1 3/31/2015 12/1/2015 12/31/2015 
2 7/25/2015 1/1/2015 1/10/2015 
2 7/25/2015 1/20/2015 1/31/2015 
2 7/25/2015 2/1/2015 2/28/2015 
2 7/25/2015 3/1/2015 3/31/2015 
2 7/25/2015 4/1/2015 4/30/2015 
2 7/25/2015 5/1/2015 5/31/2015 
2 7/25/2015 6/1/2015 6/30/2015 
2 7/25/2015 7/1/2015 7/31/2015 
2 7/25/2015 8/1/2015 8/31/2015 
2 7/25/2015 9/1/2015 9/30/2015 
2 7/25/2015 10/1/2015 10/31/2015 
2 7/25/2015 11/1/2015 11/30/2015 
2 7/25/2015 12/1/2015 12/31/2015 
3 8/29/2015 8/1/2015 8/31/2015 
4 1/20/2015 1/1/2015 1/31/2015 
4 1/20/2015 2/1/2015 2/28/2015 
run;

data want (keep=id procdate);
  array tdate{0:13} _temporary_;
  set have;
  by id;
  where procdate<=end;            /* Exclude early periods*/
  where also start<=procdate+13;  /* Exclude late periods */

  if first.id then do;
    call missing(of tdate{*});
    dstart=procdate;
  end;
  else dstart=start;

  do d=dstart to min(procdate+13,end);
    tdate{d-procdate}=1;
  end;
  
  if last.id and sum(of tdate{*})=14;
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
chuakp
Obsidian | Level 7

Thanks so much.  I have played around with this code and believe this is an elegant solution.

Astounding
PROC Star

If your data set is of reasonable size, I would transform it into both:

 

(a) a format that translates ID + date into 'COVERED', and

(b) a list of all ID + PROCDATE combinations

 

For example:

 

data format_me;

set have;

do date=start to end;

   output;

end;

keep id date;

run;

proc sort data=format_me NODUPKEY;   /* necessary because of overlap at the endpoints */

by id date;

run;

data format_me2;

set format_me;

fmtname='$covered';

label='COVERED';

start = catx('+', ID, date);

run;

proc format cntlin=format_me2;

run;

 

This much gives you a format callled $covered that translates every combination of ID + date into the word "COVERED".

 

Then get the PROC DATE list:

 

proc sort data=have (keep=id procdate) out=want NODUPKEY;

by id procdate;

run;

 

data want;

set want;

status='INCLUDED';

do date=procdate + 1 to procdate + 14 until (status='EXCLUDED');

   if put( catx('+', id, date), $covered.) ne 'COVERED' then status='EXCLUDED';

end;

drop date;

run;

 

It's a bit of work, but it's a tricky situation to both span the 14 days while at the same time allowing the coverage to be found across multiple rows of the original data set.

 

If your data set is huge, you can reduce the format size by grouping into ranges of continuous days rather than formatting every single day.  But I left out that complication as (at least for now) unnecessary.

chuakp
Obsidian | Level 7

Hi, this is a great approach, but as you suspected I am running into memory issues.  Is there a workaround?  Thanks.

 

Astounding
PROC Star

You can try to work with FORMAT_ME2 before feeding it into PROC FORMAT.

 

The basic idea would be to collapse the days into ranges, rather than individual dates.  Right now, every ID + date value is stored in the variable START.  You would need to create a second variable END, so that each START / END combination represents a range of dates for that ID. 

 

The processing for that is trickier than it might seem, because of the need to detect when the next date belongs in the same range (i.e., when the next date is one greater than the current date).  The easiest approach is probably to go through the data set FORMAT_ME2 twice.  The first time assign RANGE which merely changes value each time a new range is encountered.  Then the second time use RANGE to shrink the data set to a manageable size, computing START and END.  Give it a shot, and let me know if you need help with that part.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 3559 views
  • 0 likes
  • 4 in conversation