Solved: Re: Counting variables

wj2 · Posted 04-10-2019 01:57 PM

Hi, I need some help with programming a counting procedure. Specifically, I am working with health record data and I need to create a binary variable (yes vs. no) for each patient in the data set to indicate the following:

- whether the patient had 2 or more encounters separated by at least 90 days, but not more than 365 days, with Diagnosis "A"

Of note: the occurrence of the diagnosis can be present before 90 days of its last occurrence, but there must be at least 90 days between any two occurrences of the diagnosis.

Any recommendations on how to code this using a data step procedure with a first.variable/last.variable statement would be much appreciated.

The other issue is not I have a a large group of diagnoses (~100) that I would like to do this for, and if ANY of the diagnoses within this group meets the condition above, then I would like to create a variable to indicate so. For example: IF diagnosis A=1 or diagnosis B=1 or diagnosis C=1.... then var=1. However, is there an efficient way of coding this other than doing the coding for the condition above for each diagnosis prior to combining ~100 of these conditions? Any suggestions for coding this would be much appreciated.

Below is an example of how my data set is organized. Each row represents a separate encounter for each patient.

Patient_ID diagnosis diagnosis_date

1 A 07JAN2015

1 B 11FEB2016

1 A 04MAY2015

1 C 07JAN2015

1 B 07MAR2015

2 X 23FEB2017

2 X 25FEB2018

2 C 13JAN2016

2 F 20OCT2014

2 T 11NOV2013

3 E 07JAN2015

3 R 07JAN2015

3 B 07FEB2016

3 B 25APR2016

3 B 02JUN2016

.....

ballardw · Posted 04-10-2019 07:01 PM

This may get a bit closer.

data work.test;
length pid $1.;
length diag_cd $1.;
length diag_dt 8.; format diag_dt yymmdd10.;
infile datalines ;
input pid $ diag_cd $ diag_dt date11.;
datalines;
1 A 07JAN2015
1 B 11FEB2016
1 A 04MAY2015
1 C 07JAN2015
1 B 07MAR2015
2 X 23FEB2017
2 X 25FEB2018
2 C 13JAN2016
2 F 20OCT2014
2 T 11NOV2013
3 E 07JAN2015
3 R 07JAN2015
3 B 07FEB2016
3 B 25APR2016
3 B 02JUN2016
;
run;


proc sql;
   create table temp as
   select a.pid, a.diag_cd,a.diag_dt
          ,(90 le (b.diag_dt-a.diag_dt) le 365) as newvar
   from work.test as a
        join 
        work.test as b
        on a.pid=b.pid and a.diag_cd=b.diag_cd
   where a.diag_dt < b.diag_dt
   ;
quit;

The above code compares all values that match PID and DIAG_CD when the dates aren't the same. There can be multiple outputs and the more often a particular diag_cd is used the more output records. Rules for selecting which date(s) should appear in the output weren't supplied.

This may get what you want. If there is a newvar=1 then one of the possible many will be output.

proc sort data=temp;
   by pid diag_cd descending newvar ;
run;

data want;
   set temp;
   by pid diag_cd;
   if first.pid;
run;

View solution in original post

novinosrin · Posted 04-10-2019 02:07 PM

Hi @wj2 If you could post the expected OUTPUT WANT sample for your sample INPUT while explaining the logic and why would help.

ballardw · Posted 04-10-2019 02:15 PM

Please provide what you would expect the output data to look like as well. Since you setting a value based on multiple records there are several different ways the output might look like.

Your second bit about different diagnosis is not very clear. Do you mean to compare Diagnosis=A with Diagnosis=B for the 90/ 365 day interval?

And for clarification of your 90 day rule does the following (NOT your example data)

Patient_ID diagnosis diagnosis_date 
 1         A         07JAN2015
 1         A         11MAR2015
 1         A         14APR2015

14Apr2015 count as more than 90 because it is more than 90 days from the 07JAN2015?

wj2 · Posted 04-10-2019 02:31 PM

@novinosrin @ballardw Yes, I would like the final expected output to include a variable (e.g., newvar) for each patient indicating whether any diagnosis within a specified group of diagnoses (let's say A, B, C, or D) meets the condition above. For example, if diagnosis A meets the condition for patient 1 and diagnosis B meets the condition for patient 3, then newvar=1 for both patients. So something like this:

Patient_ID diagnosis diagnosis_date newvar

1 A 07JAN2015 1

2 X 23FEB2017 0

3 B 02JUN2016 1

.....

@ballardw, yes- your example would meet the 90 day rule since there are two occurrences of the diagnosis, and one occurrence (14Apr2015) is more than 90 days than another occurrence (07JAN2015).

Thank you both for the help!

ballardw · Posted 04-10-2019 02:36 PM

@wj2 wrote:

@novinosrin @ballardw Yes, I would like the final expected output to include a variable (e.g., newvar) for each patient indicating whether any diagnosis within a specified group of diagnoses (let's say A, B, C, or D) meets the condition above. For example, if diagnosis A meets the condition for patient 1 and diagnosis B meets the condition for patient 3, then newvar=1 for both patients. So something like this:

Patient_ID diagnosis diagnosis_date newvar

1 A 07JAN2015 1

2 X 23FEB2017 0

3 B 02JUN2016 1

.....

@ballardw, yes- your example would meet the 90 day rule since there are two occurrences of the diagnosis, and one occurrence (14Apr2015) is more than 90 days than another occurrence (07JAN2015).

Thank you both for the help!

And if two (or more) separate diagnosis for a single patient meet the condition what should the output look like.

wj2 · Posted 04-10-2019 02:40 PM

@ballardw, the output would also be newvar=1 if two or more separate diagnoses for a single patient meets the condition.

Please let me know if any other clarification is needed.

zekeT_sasaholic · Posted 04-10-2019 03:33 PM

Ok.

I thought i'd have more time to mock this up for you but nope. So i will share what i have for now.

This is the kind of healthcare data and work I do day in/out and know.

Usually with VRDC CMS claims data and lots of it. I am not posting this to start debates about uses (or not) of sql, merge, join.

I'm simply crafting a possible solution to your case/scenario and one which i feel is flexible if you have MORE than one diagnosis condition to seek and/or much more data to throw at it.

If you care to discuss further - reach back out. But i'm pretty swamped leading up to SGF. I'll be attending and presenting.

I also live, work, code in Chicago. Look me up via WCSUG.com

So here is my code to share with you...

/*** read mock data ***/
data work.test;
length pid $1.;
length diag_cd $1.;
length diag_dt 8.; format diag_dt yymmdd10.;
infile datalines ;
input pid $ diag_cd $ diag_dt date11.;
datalines;
1 A 07JAN2015
1 B 11FEB2016
1 A 04MAY2015
1 C 07JAN2015
1 B 07MAR2015
2 X 23FEB2017
2 X 25FEB2018
2 C 13JAN2016
2 F 20OCT2014
2 T 11NOV2013
3 E 07JAN2015
3 R 07JAN2015
3 B 07FEB2016
3 B 25APR2016
3 B 02JUN2016
;
run;

/*** create format per target set of diag conditions to find ***/
/*** this might be for allergies ***/
proc format;
invalue $dgtarga
"A" = 1
"A.1" = 1
"A.10" = 1
other = 0;
run;

/*** this might be for high blood pressure ***/
proc format;
invalue $dgtargb
"B" = 1
"B.3" = 1
"B.40" = 1
other = 0;
run;


/*** push that data thru the formats to identify which PID and thus dates to then check in next step ***/
/*** in this case since each target diag is different condition - i can separate out tables    ***/
/*** by keeping PID and date (of diag) - in my next step - i can run a min/max on the diff between dates for that PID ***/
data work.test_a (keep=pid diag_dt)
     work.test_b (keep=pid diag_dt);
set work.test;

/*** if i have other diag groups to target - i'd just have more target vars ***/
format target_a 8.;
format target_b 8.;
target_a=input(diag_cd,$dgtarga.);
target_b=input(diag_cd,$dgtargb.);
if target_a=1 then output work.test_a;
if target_b=1 then output work.test_b;
run;

/*** what i couldnt get to is the next step which transposes and/or does a by group (on pid) ***/
/*** for the days diff between the dates ***/

So again sorry i didnt get to more of this but i have a meeting coming up and a call that came in and snagged some of my time.

Best

Zeke Torres

wj2 · Posted 04-10-2019 04:36 PM

Thank you very much @zekeT_sasaholic for the start. I would be very interested for hearing your suggestions for coding the time interval criteria and including multiple diagnoses for this scenario as well. I am a new SAS user and I am most familiar with using first/last variable coding for counting, so an approach using that method would be preferable. @ballardw @novinosrin: please let me know if you would like additional clarification for providing any suggestions on your end as well.

novinosrin · Posted 04-10-2019 04:43 PM

Hi @wj2 I am afraid I haven't understood the required output well and so I was basically waiting for @ballardw to perhaps take the lead if he was working on it.

So my understanding goes as follows,

See the below table, the diff column, count. Are you able to make out where my thinking is going??

It's one of those things that leads to assumptions when I do not have clear understanding .

Patient_ID	diagnosis	diagnosis_date	d	Flag	count	diff
1	A	07JAN2015	07JAN2015	0	0	.
1	C	07JAN2015	07JAN2015	0	0	0
1	B	07MAR2015	07JAN2015	0	0	59
1	A	04MAY2015	04MAY2015	0	1	117
1	B	11FEB2016	11FEB2016	1	2	283
2	T	11NOV2013	11NOV2013	0	0	.
2	F	20OCT2014	20OCT2014	0	1	343
2	C	13JAN2016	20OCT2014	0	1	450
2	X	23FEB2017	20OCT2014	0	1	857
2	X	25FEB2018	20OCT2014	0	1	1224
3	E	07JAN2015	07JAN2015	0	0	.
3	R	07JAN2015	07JAN2015	0	0	0
3	B	07FEB2016	07JAN2015	0	0	396
3	B	25APR2016	07JAN2015	0	0	474
3	B	02JUN2016	07JAN2015	0	0	512

ballardw · Posted 04-10-2019 07:01 PM

This may get a bit closer.

data work.test;
length pid $1.;
length diag_cd $1.;
length diag_dt 8.; format diag_dt yymmdd10.;
infile datalines ;
input pid $ diag_cd $ diag_dt date11.;
datalines;
1 A 07JAN2015
1 B 11FEB2016
1 A 04MAY2015
1 C 07JAN2015
1 B 07MAR2015
2 X 23FEB2017
2 X 25FEB2018
2 C 13JAN2016
2 F 20OCT2014
2 T 11NOV2013
3 E 07JAN2015
3 R 07JAN2015
3 B 07FEB2016
3 B 25APR2016
3 B 02JUN2016
;
run;


proc sql;
   create table temp as
   select a.pid, a.diag_cd,a.diag_dt
          ,(90 le (b.diag_dt-a.diag_dt) le 365) as newvar
   from work.test as a
        join 
        work.test as b
        on a.pid=b.pid and a.diag_cd=b.diag_cd
   where a.diag_dt < b.diag_dt
   ;
quit;

The above code compares all values that match PID and DIAG_CD when the dates aren't the same. There can be multiple outputs and the more often a particular diag_cd is used the more output records. Rules for selecting which date(s) should appear in the output weren't supplied.

This may get what you want. If there is a newvar=1 then one of the possible many will be output.

proc sort data=temp;
   by pid diag_cd descending newvar ;
run;

data want;
   set temp;
   by pid diag_cd;
   if first.pid;
run;

wj2 · Posted 04-10-2019 08:24 PM

Hi @ballardw, thank you for the suggestion! I will give this a try.

Would you be able to also provide a suggestion using a data step procedure with first/last variables? I am little more familiar with this approach as a newer SAS user.

To respond to @novinosrin's comment: I have provided additional clarification below on what the output should be like. Please note diff should be the difference between diagnosis dates of the SAME diagnosis. Please also note that the criterion is that the variable will be counted if the difference between ANY TWO diagnosis dates is >=90 days and <=365 days (not necessarily the difference between the earliest date and the last date). Please see Patient_ID 4 for an example of this scenario below (highlighted in red).

Patient_ID	diagnosis	diagnosis_date	d	diff	count
1	A	07JAN2015	07JAN2015	0	0
1	C	07JAN2015	07JAN2015	0	0
1	B	07MAR2015	07MAR2015	0	0
1	A	04MAY2015	07JAN2015	117	1
1	B	11FEB2016	07MAR2015	333	2
2	T	11NOV2013	11NOV2013	0	0
2	F	20OCT2014	20OCT2014	0	0
2	C	13JAN2016	13JAN2016	0	0
2	X	23FEB2017	23FEB2017	0	0
2	X	25FEB2018	23FEB2017	367	1
3	E	07JAN2015	07JAN2015	0	0
3	R	16JAN2017	16JAN2017	0	0
3	B	07FEB2016	07FEB2016	0	0
3	B	25APR2016	07FEB2016	78	0
3	B	02JUN2016	07FEB2016	116	1
4	C	03MAR2017	03MAR2017	0	0
4	C	01MAY2017	03MAR2017	59	0
4	C	15APR2018	01MAY2017	349	1

novinosrin · Posted 04-10-2019 09:46 PM

@wj2 OK got it. Thank you. It's almost bed time CST and I don't have SAS software at home. I'll work on this 1st thing in the morning i.e that is if nobody takes a stab at it. I honestly believe somebody would have solved while I'm sleeping. 🙂

novinosrin · Posted 04-11-2019 09:28 AM

Good morning @wj2


data have;
input Patient_ID        diagnosis $      diagnosis_date  :date9.; 
format  diagnosis_date  date9.; 
cards;
      1                         A                  07JAN2015
      1                         B                 11FEB2016
      1                         A                  04MAY2015
      1                         C                  07JAN2015
      1                         B                  07MAR2015
      2                         X                  23FEB2017
      2                         X                  25FEB2018
      2                         C                 13JAN2016
      2                         F                  20OCT2014
      2                         T                   11NOV2013
      3                         E                  07JAN2015
      3                         R                  07JAN2015
      3                         B                  07FEB2016
      3                         B                  25APR2016
      3                         B                  02JUN2016
	  4	C	3-Mar-17	3-Mar-17
4	C	1-May-17	3-Mar-17
4	C	15-Apr-18	1-May-17
	  ;

	  
proc sql;
create table want as
select a.*,sum(90<=intck('days',a.diagnosis_date,b.diagnosis_date)<=365)>=2 as flag
from have a , have b
where a.Patient_ID=b.Patient_ID and a.diagnosis=b.diagnosis and b.diagnosis_date>a.diagnosis_date
group by a.patient_id
order by a.Patient_ID,a.diagnosis,diagnosis_date;
quit;

novinosrin · Posted 04-11-2019 09:32 AM

And if you want the HAVE table as is in full , we switch to left join and distinct

proc sql;
create table want1 as
select distinct a.*,sum(90<=intck('days',a.diagnosis_date,b.diagnosis_date)<=365)>=2 as flag
from have a left join have b
on a.Patient_ID=b.Patient_ID and a.diagnosis=b.diagnosis and b.diagnosis_date>a.diagnosis_date
group by a.patient_id
order by a.Patient_ID,a.diagnosis,diagnosis_date;
quit;

novinosrin · Posted 04-11-2019 10:13 AM

A Datastep is of course more intuitive for its sequential process


data have;
input Patient_ID        diagnosis $      diagnosis_date  :date9.; 
format  diagnosis_date  date9.; 
cards;
      1                         A                  07JAN2015
      1                         B                 11FEB2016
      1                         A                  04MAY2015
      1                         C                  07JAN2015
      1                         B                  07MAR2015
      2                         X                  23FEB2017
      2                         X                  25FEB2018
      2                         C                 13JAN2016
      2                         F                  20OCT2014
      2                         T                   11NOV2013
      3                         E                  07JAN2015
      3                         R                  07JAN2015
      3                         B                  07FEB2016
      3                         B                  25APR2016
      3                         B                  02JUN2016
	  4	C	3-Mar-17	3-Mar-17
4	C	1-May-17	3-Mar-17
4	C	15-Apr-18	1-May-17
	  ;

	  proc sort data=have out=_have;
by Patient_ID diagnosis diagnosis_date;
run;

data want;
if _n_=1 then do;
if 0 then set have(rename=(diagnosis_date=d));
   dcl hash H (dataset:'have(rename=(diagnosis_date=d))',multidata:'y') ;
   h.definekey  ("Patient_ID","diagnosis") ;
   h.definedata ("d") ;
   h.definedone () ;
end;
set _have;
c=0;
by Patient_ID diagnosis ; 
if first.Patient_ID then c1=0;
do rc=h.find() by 0 while(rc=0);
if d>diagnosis_date and 90<=intck('days',diagnosis_date,d)<=365 then c+1;
rc=h.find_next();
end;
c1+c;
drop rc d;
run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!