BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
kbanders
Calcite | Level 5

Hello everyone!
I am working with a large dataset of medical claims trying to see how many of my patients are getting skin cancer treated within 4 months of diagnosis. So for each patient, I only want to keep observations that contain biopsied lesion code (1) and skin cancer code (2) if they are within 4 months of each other. Here’s an example of the data set:

 

Patient IDDiagnosis Date 
112012/3
122012/5
122012/5
122012/5
112013/4
112014/5
112014/6
122014/10
212012/6
212013/6
212014/6
212014/8
212015/6
312013/5
312013/6
322013/8
312015/2
322015/8

 

The final data set would look something like this:

Patient IDDiagnosis Date 
112012/3
122012/5
122012/5
122012/5
112014/6
122014/10
312013/5
312013/6
322013/8

 

Not all biopsied lesions are connected to a skin cancer, which is why I am having difficulty dropping unwanted observations. 

If anyone could help write this code, I would really appreciate it! Thank you. 

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

This problem is more tractable if your claims have some sort of ID (i.e. there are no duplicates) Then you can use simple SQL, like this:

 

data have;
input ID  Diag   Date:anydtdte.;
claimId + 1; /* Make the claims distinct */
format date yymmdd10.;
datalines; 
1   1   2012/3
1   2   2012/5
1   2   2012/5
1   2   2012/5
1   1   2013/4
1   1   2014/5
1   1   2014/6
1   2   2014/10
2   1   2012/6
2   1   2013/6
2   1   2014/6
2   1   2014/8
2   1   2015/6
3   1   2013/5
3   1   2013/6
3   2   2013/8
3   1   2015/2
3   2   2015/8
;

proc sql;
create table want as
(select * from have as a where diag=1 and 
    exists (select * from have where diag=2 and id=a.id and intck("month", a.date, date) between 0 and 4))
union
(select * from have as b where diag=2 and 
    exists (select * from have where diag=1 and id=b.id and intck("month", date, b.date) between 0 and 4))
order by id, date, diag;
quit;
PG

View solution in original post

3 REPLIES 3
mkeintz
PROC Star

This is a good example for:

  1. Reading each patient_id's data twice.  The first time to record use the diagnosis/year/month in hand to record a list of acceptable combinations.  If the current diagnosis is 1, then step forward to record acceptable combinations for diagnosis 2.  If it is a 2, the step backward to record acceptable combinations of diagnosis 1.

    The second pass through the patient_id compares the diagnosis/year/month in-hand to the recorded list of acceptable combinations.

  2. Using a 3-dimensional array (called KEEP below) to record the keep dummies from #1 above.  The dimensions would be diagnosis {1 through 2}, year {2012 through 2019} and month {1 through 12}.

 

data want (drop=_:);
  array keep {1:2,2012:2019,1:12} _temporary_;
  call missing(of keep{*});

  do until (last.patient_id);
    set have;
    by patient_id;

    /*if diagnosis=1 then the month increment is +1, otherwise it is -1 */
    _increment=ifn(diagnosis=1,1,-1); 
    /*if diagnosis=1 then set keep dummies for plane=2 (diagnosis 2), otherwise set for plane=1 */
    _plane=3-diagnosis;

    /* For each incoming record, set dummies for the complementary diagnosis */
    do _i=0 to 4;
      keep{_plane,year,month}=1;
      month=month+_increment;
/* If month is out of range, the re-set it, and also increment the year */ if not (1 <= month <= 12) then do; year=year + _increment; if month=13 then month=1; else month=12; end; end; end; /* Now re-read the patient_id and output records with corresponding keep dummies */ do until (last.patient_id); set have; by patient_id; if keep{diagnosis,year,month}=1 then output; end; run;

You didn't provide the starting data in the format of a working data step, so I don't offer validation of this program.

 

Notice I set the lower and upper bounds of the 2nd dimension of the array to accommodate a year range from 2012 to 2019.

 

Edited addition:

I assume the data are sorted by ID  (though not necessarily date year/month within id).

 

Although the data step reads each patient_id twice, it is NOT twice the amount of input disk activity, because the recently read observations will have been cached in memory when retrieved for the second pass.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
kbanders
Calcite | Level 5

Thank you both for your help!!

PGStats
Opal | Level 21

This problem is more tractable if your claims have some sort of ID (i.e. there are no duplicates) Then you can use simple SQL, like this:

 

data have;
input ID  Diag   Date:anydtdte.;
claimId + 1; /* Make the claims distinct */
format date yymmdd10.;
datalines; 
1   1   2012/3
1   2   2012/5
1   2   2012/5
1   2   2012/5
1   1   2013/4
1   1   2014/5
1   1   2014/6
1   2   2014/10
2   1   2012/6
2   1   2013/6
2   1   2014/6
2   1   2014/8
2   1   2015/6
3   1   2013/5
3   1   2013/6
3   2   2013/8
3   1   2015/2
3   2   2015/8
;

proc sql;
create table want as
(select * from have as a where diag=1 and 
    exists (select * from have where diag=2 and id=a.id and intck("month", a.date, date) between 0 and 4))
union
(select * from have as b where diag=2 and 
    exists (select * from have where diag=1 and id=b.id and intck("month", date, b.date) between 0 and 4))
order by id, date, diag;
quit;
PG
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 1518 views
  • 2 likes
  • 3 in conversation