Hello everyone!
I am working with a large dataset of medical claims trying to see how many of my patients are getting skin cancer treated within 4 months of diagnosis. So for each patient, I only want to keep observations that contain biopsied lesion code (1) and skin cancer code (2) if they are within 4 months of each other. Here’s an example of the data set:
Patient ID | Diagnosis | Date |
1 | 1 | 2012/3 |
1 | 2 | 2012/5 |
1 | 2 | 2012/5 |
1 | 2 | 2012/5 |
1 | 1 | 2013/4 |
1 | 1 | 2014/5 |
1 | 1 | 2014/6 |
1 | 2 | 2014/10 |
2 | 1 | 2012/6 |
2 | 1 | 2013/6 |
2 | 1 | 2014/6 |
2 | 1 | 2014/8 |
2 | 1 | 2015/6 |
3 | 1 | 2013/5 |
3 | 1 | 2013/6 |
3 | 2 | 2013/8 |
3 | 1 | 2015/2 |
3 | 2 | 2015/8 |
The final data set would look something like this:
Patient ID | Diagnosis | Date |
1 | 1 | 2012/3 |
1 | 2 | 2012/5 |
1 | 2 | 2012/5 |
1 | 2 | 2012/5 |
1 | 1 | 2014/6 |
1 | 2 | 2014/10 |
3 | 1 | 2013/5 |
3 | 1 | 2013/6 |
3 | 2 | 2013/8 |
Not all biopsied lesions are connected to a skin cancer, which is why I am having difficulty dropping unwanted observations.
If anyone could help write this code, I would really appreciate it! Thank you.
This problem is more tractable if your claims have some sort of ID (i.e. there are no duplicates) Then you can use simple SQL, like this:
data have;
input ID Diag Date:anydtdte.;
claimId + 1; /* Make the claims distinct */
format date yymmdd10.;
datalines;
1 1 2012/3
1 2 2012/5
1 2 2012/5
1 2 2012/5
1 1 2013/4
1 1 2014/5
1 1 2014/6
1 2 2014/10
2 1 2012/6
2 1 2013/6
2 1 2014/6
2 1 2014/8
2 1 2015/6
3 1 2013/5
3 1 2013/6
3 2 2013/8
3 1 2015/2
3 2 2015/8
;
proc sql;
create table want as
(select * from have as a where diag=1 and
exists (select * from have where diag=2 and id=a.id and intck("month", a.date, date) between 0 and 4))
union
(select * from have as b where diag=2 and
exists (select * from have where diag=1 and id=b.id and intck("month", date, b.date) between 0 and 4))
order by id, date, diag;
quit;
This is a good example for:
data want (drop=_:);
array keep {1:2,2012:2019,1:12} _temporary_;
call missing(of keep{*});
do until (last.patient_id);
set have;
by patient_id;
/*if diagnosis=1 then the month increment is +1, otherwise it is -1 */
_increment=ifn(diagnosis=1,1,-1);
/*if diagnosis=1 then set keep dummies for plane=2 (diagnosis 2), otherwise set for plane=1 */
_plane=3-diagnosis;
/* For each incoming record, set dummies for the complementary diagnosis */
do _i=0 to 4;
keep{_plane,year,month}=1;
month=month+_increment;
/* If month is out of range, the re-set it, and also increment the year */
if not (1 <= month <= 12) then do;
year=year + _increment;
if month=13 then month=1;
else month=12;
end;
end;
end;
/* Now re-read the patient_id and output records with corresponding keep dummies */
do until (last.patient_id);
set have;
by patient_id;
if keep{diagnosis,year,month}=1 then output;
end;
run;
You didn't provide the starting data in the format of a working data step, so I don't offer validation of this program.
Notice I set the lower and upper bounds of the 2nd dimension of the array to accommodate a year range from 2012 to 2019.
Edited addition:
I assume the data are sorted by ID (though not necessarily date year/month within id).
Although the data step reads each patient_id twice, it is NOT twice the amount of input disk activity, because the recently read observations will have been cached in memory when retrieved for the second pass.
Thank you both for your help!!
This problem is more tractable if your claims have some sort of ID (i.e. there are no duplicates) Then you can use simple SQL, like this:
data have;
input ID Diag Date:anydtdte.;
claimId + 1; /* Make the claims distinct */
format date yymmdd10.;
datalines;
1 1 2012/3
1 2 2012/5
1 2 2012/5
1 2 2012/5
1 1 2013/4
1 1 2014/5
1 1 2014/6
1 2 2014/10
2 1 2012/6
2 1 2013/6
2 1 2014/6
2 1 2014/8
2 1 2015/6
3 1 2013/5
3 1 2013/6
3 2 2013/8
3 1 2015/2
3 2 2015/8
;
proc sql;
create table want as
(select * from have as a where diag=1 and
exists (select * from have where diag=2 and id=a.id and intck("month", a.date, date) between 0 and 4))
union
(select * from have as b where diag=2 and
exists (select * from have where diag=1 and id=b.id and intck("month", date, b.date) between 0 and 4))
order by id, date, diag;
quit;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.