Hi, I need some help with programming a counting procedure. Specifically, I am working with health record data and I need to create a binary variable (yes vs. no) for each patient in the data set to indicate the following:
- whether the patient had 2 or more encounters separated by at least 90 days, but not more than 365 days, with Diagnosis "A"
Of note: the occurrence of the diagnosis can be present before 90 days of its last occurrence, but there must be at least 90 days between any two occurrences of the diagnosis.
Any recommendations on how to code this using a data step procedure with a first.variable/last.variable statement would be much appreciated.
The other issue is not I have a a large group of diagnoses (~100) that I would like to do this for, and if ANY of the diagnoses within this group meets the condition above, then I would like to create a variable to indicate so. For example: IF diagnosis A=1 or diagnosis B=1 or diagnosis C=1.... then var=1. However, is there an efficient way of coding this other than doing the coding for the condition above for each diagnosis prior to combining ~100 of these conditions? Any suggestions for coding this would be much appreciated.
Below is an example of how my data set is organized. Each row represents a separate encounter for each patient.
Patient_ID diagnosis diagnosis_date
1 A 07JAN2015
1 B 11FEB2016
1 A 04MAY2015
1 C 07JAN2015
1 B 07MAR2015
2 X 23FEB2017
2 X 25FEB2018
2 C 13JAN2016
2 F 20OCT2014
2 T 11NOV2013
3 E 07JAN2015
3 R 07JAN2015
3 B 07FEB2016
3 B 25APR2016
3 B 02JUN2016
.....
This may get a bit closer.
data work.test; length pid $1.; length diag_cd $1.; length diag_dt 8.; format diag_dt yymmdd10.; infile datalines ; input pid $ diag_cd $ diag_dt date11.; datalines; 1 A 07JAN2015 1 B 11FEB2016 1 A 04MAY2015 1 C 07JAN2015 1 B 07MAR2015 2 X 23FEB2017 2 X 25FEB2018 2 C 13JAN2016 2 F 20OCT2014 2 T 11NOV2013 3 E 07JAN2015 3 R 07JAN2015 3 B 07FEB2016 3 B 25APR2016 3 B 02JUN2016 ; run; proc sql; create table temp as select a.pid, a.diag_cd,a.diag_dt ,(90 le (b.diag_dt-a.diag_dt) le 365) as newvar from work.test as a join work.test as b on a.pid=b.pid and a.diag_cd=b.diag_cd where a.diag_dt < b.diag_dt ; quit;
The above code compares all values that match PID and DIAG_CD when the dates aren't the same. There can be multiple outputs and the more often a particular diag_cd is used the more output records. Rules for selecting which date(s) should appear in the output weren't supplied.
This may get what you want. If there is a newvar=1 then one of the possible many will be output.
proc sort data=temp; by pid diag_cd descending newvar ; run; data want; set temp; by pid diag_cd; if first.pid; run;
Hi @wj2 If you could post the expected OUTPUT WANT sample for your sample INPUT while explaining the logic and why would help.
Please provide what you would expect the output data to look like as well. Since you setting a value based on multiple records there are several different ways the output might look like.
Your second bit about different diagnosis is not very clear. Do you mean to compare Diagnosis=A with Diagnosis=B for the 90/ 365 day interval?
And for clarification of your 90 day rule does the following (NOT your example data)
Patient_ID diagnosis diagnosis_date 1 A 07JAN2015 1 A 11MAR2015 1 A 14APR2015
14Apr2015 count as more than 90 because it is more than 90 days from the 07JAN2015?
@novinosrin @ballardw Yes, I would like the final expected output to include a variable (e.g., newvar) for each patient indicating whether any diagnosis within a specified group of diagnoses (let's say A, B, C, or D) meets the condition above. For example, if diagnosis A meets the condition for patient 1 and diagnosis B meets the condition for patient 3, then newvar=1 for both patients. So something like this:
Patient_ID diagnosis diagnosis_date newvar
1 A 07JAN2015 1
2 X 23FEB2017 0
3 B 02JUN2016 1
.....
@ballardw, yes- your example would meet the 90 day rule since there are two occurrences of the diagnosis, and one occurrence (14Apr2015) is more than 90 days than another occurrence (07JAN2015).
Thank you both for the help!
@wj2 wrote:
@novinosrin @ballardw Yes, I would like the final expected output to include a variable (e.g., newvar) for each patient indicating whether any diagnosis within a specified group of diagnoses (let's say A, B, C, or D) meets the condition above. For example, if diagnosis A meets the condition for patient 1 and diagnosis B meets the condition for patient 3, then newvar=1 for both patients. So something like this:
Patient_ID diagnosis diagnosis_date newvar
1 A 07JAN2015 1
2 X 23FEB2017 0
3 B 02JUN2016 1
.....
@ballardw, yes- your example would meet the 90 day rule since there are two occurrences of the diagnosis, and one occurrence (14Apr2015) is more than 90 days than another occurrence (07JAN2015).
Thank you both for the help!
And if two (or more) separate diagnosis for a single patient meet the condition what should the output look like.
@ballardw, the output would also be newvar=1 if two or more separate diagnoses for a single patient meets the condition.
Please let me know if any other clarification is needed.
Ok.
I thought i'd have more time to mock this up for you but nope. So i will share what i have for now.
This is the kind of healthcare data and work I do day in/out and know.
Usually with VRDC CMS claims data and lots of it. I am not posting this to start debates about uses (or not) of sql, merge, join.
I'm simply crafting a possible solution to your case/scenario and one which i feel is flexible if you have MORE than one diagnosis condition to seek and/or much more data to throw at it.
If you care to discuss further - reach back out. But i'm pretty swamped leading up to SGF. I'll be attending and presenting.
I also live, work, code in Chicago. Look me up via WCSUG.com
So here is my code to share with you...
/*** read mock data ***/
data work.test; length pid $1.; length diag_cd $1.; length diag_dt 8.; format diag_dt yymmdd10.; infile datalines ; input pid $ diag_cd $ diag_dt date11.; datalines; 1 A 07JAN2015 1 B 11FEB2016 1 A 04MAY2015 1 C 07JAN2015 1 B 07MAR2015 2 X 23FEB2017 2 X 25FEB2018 2 C 13JAN2016 2 F 20OCT2014 2 T 11NOV2013 3 E 07JAN2015 3 R 07JAN2015 3 B 07FEB2016 3 B 25APR2016 3 B 02JUN2016 ; run; /*** create format per target set of diag conditions to find ***/
/*** this might be for allergies ***/ proc format; invalue $dgtarga "A" = 1 "A.1" = 1 "A.10" = 1 other = 0; run; /*** this might be for high blood pressure ***/ proc format; invalue $dgtargb "B" = 1 "B.3" = 1 "B.40" = 1 other = 0; run;
/*** push that data thru the formats to identify which PID and thus dates to then check in next step ***/
/*** in this case since each target diag is different condition - i can separate out tables ***/
/*** by keeping PID and date (of diag) - in my next step - i can run a min/max on the diff between dates for that PID ***/ data work.test_a (keep=pid diag_dt) work.test_b (keep=pid diag_dt); set work.test;
/*** if i have other diag groups to target - i'd just have more target vars ***/ format target_a 8.; format target_b 8.; target_a=input(diag_cd,$dgtarga.); target_b=input(diag_cd,$dgtargb.); if target_a=1 then output work.test_a; if target_b=1 then output work.test_b; run;
/*** what i couldnt get to is the next step which transposes and/or does a by group (on pid) ***/
/*** for the days diff between the dates ***/
So again sorry i didnt get to more of this but i have a meeting coming up and a call that came in and snagged some of my time.
Best
Zeke Torres
Thank you very much @zekeT_sasaholic for the start. I would be very interested for hearing your suggestions for coding the time interval criteria and including multiple diagnoses for this scenario as well. I am a new SAS user and I am most familiar with using first/last variable coding for counting, so an approach using that method would be preferable. @ballardw @novinosrin: please let me know if you would like additional clarification for providing any suggestions on your end as well.
Hi @wj2 I am afraid I haven't understood the required output well and so I was basically waiting for @ballardw to perhaps take the lead if he was working on it.
So my understanding goes as follows,
See the below table, the diff column, count. Are you able to make out where my thinking is going??
It's one of those things that leads to assumptions when I do not have clear understanding .
Patient_ID | diagnosis | diagnosis_date | d | Flag | count | diff |
---|---|---|---|---|---|---|
1 | A | 07JAN2015 | 07JAN2015 | 0 | 0 | . |
1 | C | 07JAN2015 | 07JAN2015 | 0 | 0 | 0 |
1 | B | 07MAR2015 | 07JAN2015 | 0 | 0 | 59 |
1 | A | 04MAY2015 | 04MAY2015 | 0 | 1 | 117 |
1 | B | 11FEB2016 | 11FEB2016 | 1 | 2 | 283 |
2 | T | 11NOV2013 | 11NOV2013 | 0 | 0 | . |
2 | F | 20OCT2014 | 20OCT2014 | 0 | 1 | 343 |
2 | C | 13JAN2016 | 20OCT2014 | 0 | 1 | 450 |
2 | X | 23FEB2017 | 20OCT2014 | 0 | 1 | 857 |
2 | X | 25FEB2018 | 20OCT2014 | 0 | 1 | 1224 |
3 | E | 07JAN2015 | 07JAN2015 | 0 | 0 | . |
3 | R | 07JAN2015 | 07JAN2015 | 0 | 0 | 0 |
3 | B | 07FEB2016 | 07JAN2015 | 0 | 0 | 396 |
3 | B | 25APR2016 | 07JAN2015 | 0 | 0 | 474 |
3 | B | 02JUN2016 | 07JAN2015 | 0 | 0 | 512 |
This may get a bit closer.
data work.test; length pid $1.; length diag_cd $1.; length diag_dt 8.; format diag_dt yymmdd10.; infile datalines ; input pid $ diag_cd $ diag_dt date11.; datalines; 1 A 07JAN2015 1 B 11FEB2016 1 A 04MAY2015 1 C 07JAN2015 1 B 07MAR2015 2 X 23FEB2017 2 X 25FEB2018 2 C 13JAN2016 2 F 20OCT2014 2 T 11NOV2013 3 E 07JAN2015 3 R 07JAN2015 3 B 07FEB2016 3 B 25APR2016 3 B 02JUN2016 ; run; proc sql; create table temp as select a.pid, a.diag_cd,a.diag_dt ,(90 le (b.diag_dt-a.diag_dt) le 365) as newvar from work.test as a join work.test as b on a.pid=b.pid and a.diag_cd=b.diag_cd where a.diag_dt < b.diag_dt ; quit;
The above code compares all values that match PID and DIAG_CD when the dates aren't the same. There can be multiple outputs and the more often a particular diag_cd is used the more output records. Rules for selecting which date(s) should appear in the output weren't supplied.
This may get what you want. If there is a newvar=1 then one of the possible many will be output.
proc sort data=temp; by pid diag_cd descending newvar ; run; data want; set temp; by pid diag_cd; if first.pid; run;
Hi @ballardw, thank you for the suggestion! I will give this a try.
Would you be able to also provide a suggestion using a data step procedure with first/last variables? I am little more familiar with this approach as a newer SAS user.
To respond to @novinosrin's comment: I have provided additional clarification below on what the output should be like. Please note diff should be the difference between diagnosis dates of the SAME diagnosis. Please also note that the criterion is that the variable will be counted if the difference between ANY TWO diagnosis dates is >=90 days and <=365 days (not necessarily the difference between the earliest date and the last date). Please see Patient_ID 4 for an example of this scenario below (highlighted in red).
Patient_ID | diagnosis | diagnosis_date | d | diff | count |
1 | A | 07JAN2015 | 07JAN2015 | 0 | 0 |
1 | C | 07JAN2015 | 07JAN2015 | 0 | 0 |
1 | B | 07MAR2015 | 07MAR2015 | 0 | 0 |
1 | A | 04MAY2015 | 07JAN2015 | 117 | 1 |
1 | B | 11FEB2016 | 07MAR2015 | 333 | 2 |
2 | T | 11NOV2013 | 11NOV2013 | 0 | 0 |
2 | F | 20OCT2014 | 20OCT2014 | 0 | 0 |
2 | C | 13JAN2016 | 13JAN2016 | 0 | 0 |
2 | X | 23FEB2017 | 23FEB2017 | 0 | 0 |
2 | X | 25FEB2018 | 23FEB2017 | 367 | 1 |
3 | E | 07JAN2015 | 07JAN2015 | 0 | 0 |
3 | R | 16JAN2017 | 16JAN2017 | 0 | 0 |
3 | B | 07FEB2016 | 07FEB2016 | 0 | 0 |
3 | B | 25APR2016 | 07FEB2016 | 78 | 0 |
3 | B | 02JUN2016 | 07FEB2016 | 116 | 1 |
4 | C | 03MAR2017 | 03MAR2017 | 0 | 0 |
4 | C | 01MAY2017 | 03MAR2017 | 59 | 0 |
4 | C | 15APR2018 | 01MAY2017 | 349 | 1 |
@wj2 OK got it. Thank you. It's almost bed time CST and I don't have SAS software at home. I'll work on this 1st thing in the morning i.e that is if nobody takes a stab at it. I honestly believe somebody would have solved while I'm sleeping. 🙂
Good morning @wj2
data have;
input Patient_ID diagnosis $ diagnosis_date :date9.;
format diagnosis_date date9.;
cards;
1 A 07JAN2015
1 B 11FEB2016
1 A 04MAY2015
1 C 07JAN2015
1 B 07MAR2015
2 X 23FEB2017
2 X 25FEB2018
2 C 13JAN2016
2 F 20OCT2014
2 T 11NOV2013
3 E 07JAN2015
3 R 07JAN2015
3 B 07FEB2016
3 B 25APR2016
3 B 02JUN2016
4 C 3-Mar-17 3-Mar-17
4 C 1-May-17 3-Mar-17
4 C 15-Apr-18 1-May-17
;
proc sql;
create table want as
select a.*,sum(90<=intck('days',a.diagnosis_date,b.diagnosis_date)<=365)>=2 as flag
from have a , have b
where a.Patient_ID=b.Patient_ID and a.diagnosis=b.diagnosis and b.diagnosis_date>a.diagnosis_date
group by a.patient_id
order by a.Patient_ID,a.diagnosis,diagnosis_date;
quit;
And if you want the HAVE table as is in full , we switch to left join and distinct
proc sql;
create table want1 as
select distinct a.*,sum(90<=intck('days',a.diagnosis_date,b.diagnosis_date)<=365)>=2 as flag
from have a left join have b
on a.Patient_ID=b.Patient_ID and a.diagnosis=b.diagnosis and b.diagnosis_date>a.diagnosis_date
group by a.patient_id
order by a.Patient_ID,a.diagnosis,diagnosis_date;
quit;
A Datastep is of course more intuitive for its sequential process
data have;
input Patient_ID diagnosis $ diagnosis_date :date9.;
format diagnosis_date date9.;
cards;
1 A 07JAN2015
1 B 11FEB2016
1 A 04MAY2015
1 C 07JAN2015
1 B 07MAR2015
2 X 23FEB2017
2 X 25FEB2018
2 C 13JAN2016
2 F 20OCT2014
2 T 11NOV2013
3 E 07JAN2015
3 R 07JAN2015
3 B 07FEB2016
3 B 25APR2016
3 B 02JUN2016
4 C 3-Mar-17 3-Mar-17
4 C 1-May-17 3-Mar-17
4 C 15-Apr-18 1-May-17
;
proc sort data=have out=_have;
by Patient_ID diagnosis diagnosis_date;
run;
data want;
if _n_=1 then do;
if 0 then set have(rename=(diagnosis_date=d));
dcl hash H (dataset:'have(rename=(diagnosis_date=d))',multidata:'y') ;
h.definekey ("Patient_ID","diagnosis") ;
h.definedata ("d") ;
h.definedone () ;
end;
set _have;
c=0;
by Patient_ID diagnosis ;
if first.Patient_ID then c1=0;
do rc=h.find() by 0 while(rc=0);
if d>diagnosis_date and 90<=intck('days',diagnosis_date,d)<=365 then c+1;
rc=h.find_next();
end;
c1+c;
drop rc d;
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.