BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Calcite | Level 5

## How do To keep only observations within 4 months of less of each other

Hello everyone!
I am working with a large dataset of medical claims trying to see how many of my patients are getting skin cancer treated within 4 months of diagnosis. So for each patient, I only want to keep observations that contain biopsied lesion code (1) and skin cancer code (2) if they are within 4 months of each other. Here’s an example of the data set:

 Patient ID Diagnosis Date 1 1 2012/3 1 2 2012/5 1 2 2012/5 1 2 2012/5 1 1 2013/4 1 1 2014/5 1 1 2014/6 1 2 2014/10 2 1 2012/6 2 1 2013/6 2 1 2014/6 2 1 2014/8 2 1 2015/6 3 1 2013/5 3 1 2013/6 3 2 2013/8 3 1 2015/2 3 2 2015/8

The final data set would look something like this:

 Patient ID Diagnosis Date 1 1 2012/3 1 2 2012/5 1 2 2012/5 1 2 2012/5 1 1 2014/6 1 2 2014/10 3 1 2013/5 3 1 2013/6 3 2 2013/8

Not all biopsied lesions are connected to a skin cancer, which is why I am having difficulty dropping unwanted observations.

If anyone could help write this code, I would really appreciate it! Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions
Opal | Level 21

## Re: How do To keep only observations within 4 months of less of each other

This problem is more tractable if your claims have some sort of ID (i.e. there are no duplicates) Then you can use simple SQL, like this:

``````data have;
input ID  Diag   Date:anydtdte.;
claimId + 1; /* Make the claims distinct */
format date yymmdd10.;
datalines;
1   1   2012/3
1   2   2012/5
1   2   2012/5
1   2   2012/5
1   1   2013/4
1   1   2014/5
1   1   2014/6
1   2   2014/10
2   1   2012/6
2   1   2013/6
2   1   2014/6
2   1   2014/8
2   1   2015/6
3   1   2013/5
3   1   2013/6
3   2   2013/8
3   1   2015/2
3   2   2015/8
;

proc sql;
create table want as
(select * from have as a where diag=1 and
exists (select * from have where diag=2 and id=a.id and intck("month", a.date, date) between 0 and 4))
union
(select * from have as b where diag=2 and
exists (select * from have where diag=1 and id=b.id and intck("month", date, b.date) between 0 and 4))
order by id, date, diag;
quit;``````
PG
3 REPLIES 3
PROC Star

## Re: How do To keep only observations within 4 months of less of each other

This is a good example for:

1. Reading each patient_id's data twice.  The first time to record use the diagnosis/year/month in hand to record a list of acceptable combinations.  If the current diagnosis is 1, then step forward to record acceptable combinations for diagnosis 2.  If it is a 2, the step backward to record acceptable combinations of diagnosis 1.

The second pass through the patient_id compares the diagnosis/year/month in-hand to the recorded list of acceptable combinations.

2. Using a 3-dimensional array (called KEEP below) to record the keep dummies from #1 above.  The dimensions would be diagnosis {1 through 2}, year {2012 through 2019} and month {1 through 12}.

``````data want (drop=_:);
array keep {1:2,2012:2019,1:12} _temporary_;
call missing(of keep{*});

do until (last.patient_id);
set have;
by patient_id;

/*if diagnosis=1 then the month increment is +1, otherwise it is -1 */
_increment=ifn(diagnosis=1,1,-1);
/*if diagnosis=1 then set keep dummies for plane=2 (diagnosis 2), otherwise set for plane=1 */
_plane=3-diagnosis;

/* For each incoming record, set dummies for the complementary diagnosis */
do _i=0 to 4;
keep{_plane,year,month}=1;
month=month+_increment;      /* If month is out of range, the re-set it, and also increment the year */
if not (1 <= month <= 12)  then do;
year=year + _increment;
if month=13 then month=1;
else month=12;
end;
end;
end;

/* Now re-read the patient_id and output records with corresponding keep dummies */
do until (last.patient_id);
set have;
by patient_id;
if keep{diagnosis,year,month}=1 then output;
end;
run;``````

You didn't provide the starting data in the format of a working data step, so I don't offer validation of this program.

Notice I set the lower and upper bounds of the 2nd dimension of the array to accommodate a year range from 2012 to 2019.

Edited addition:

I assume the data are sorted by ID  (though not necessarily date year/month within id).

Although the data step reads each patient_id twice, it is NOT twice the amount of input disk activity, because the recently read observations will have been cached in memory when retrieved for the second pass.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Calcite | Level 5

## Re: How do To keep only observations within 4 months of less of each other

Thank you both for your help!!

Opal | Level 21

## Re: How do To keep only observations within 4 months of less of each other

This problem is more tractable if your claims have some sort of ID (i.e. there are no duplicates) Then you can use simple SQL, like this:

``````data have;
input ID  Diag   Date:anydtdte.;
claimId + 1; /* Make the claims distinct */
format date yymmdd10.;
datalines;
1   1   2012/3
1   2   2012/5
1   2   2012/5
1   2   2012/5
1   1   2013/4
1   1   2014/5
1   1   2014/6
1   2   2014/10
2   1   2012/6
2   1   2013/6
2   1   2014/6
2   1   2014/8
2   1   2015/6
3   1   2013/5
3   1   2013/6
3   2   2013/8
3   1   2015/2
3   2   2015/8
;

proc sql;
create table want as
(select * from have as a where diag=1 and
exists (select * from have where diag=2 and id=a.id and intck("month", a.date, date) between 0 and 4))
union
(select * from have as b where diag=2 and
exists (select * from have where diag=1 and id=b.id and intck("month", date, b.date) between 0 and 4))
order by id, date, diag;
quit;``````
PG
Discussion stats
• 3 replies
• 560 views
• 2 likes
• 3 in conversation