BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
wj2
Quartz | Level 8 wj2
Quartz | Level 8

Hi, I need some help with programming a counting procedure. Specifically, I am working with health record data and I need to create a binary variable (yes vs. no) for each patient in the data set to indicate the following: 

 

-  whether the patient had 2 or more encounters separated by at least 90 days, but not more than 365 days, with Diagnosis "A"

Of note: the occurrence of the diagnosis can be present before 90 days of its last occurrence, but there must be at least 90 days between any two occurrences of the diagnosis. 

 

Any recommendations on how to code this using a data step procedure with a first.variable/last.variable statement would be much appreciated.  

 

The other issue is not I have a a large group of diagnoses (~100) that I would like to do this for, and if ANY of the diagnoses within this group meets the condition above, then I would like to create a variable to indicate so. For example: IF diagnosis A=1 or diagnosis B=1 or diagnosis C=1.... then var=1. However, is there an efficient way of coding this other than doing the coding for the condition above for each diagnosis prior to combining ~100 of these conditions? Any suggestions for coding this would be much appreciated. 

 

Below is an example of how my data set is organized. Each row represents a separate encounter for each patient. 

 

Patient_ID        diagnosis       diagnosis_date      

      1                         A                  07JAN2015

      1                         B                 11FEB2016

      1                         A                  04MAY2015

      1                         C                  07JAN2015

      1                         B                  07MAR2015

      2                         X                  23FEB2017

      2                         X                  25FEB2018

      2                         C                 13JAN2016

      2                         F                  20OCT2014

      2                         T                   11NOV2013

      3                         E                  07JAN2015

      3                         R                  07JAN2015

      3                         B                  07FEB2016

      3                         B                  25APR2016

      3                         B                  02JUN2016 

     .....

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

This may get a bit closer.

data work.test;
length pid $1.;
length diag_cd $1.;
length diag_dt 8.; format diag_dt yymmdd10.;
infile datalines ;
input pid $ diag_cd $ diag_dt date11.;
datalines;
1 A 07JAN2015
1 B 11FEB2016
1 A 04MAY2015
1 C 07JAN2015
1 B 07MAR2015
2 X 23FEB2017
2 X 25FEB2018
2 C 13JAN2016
2 F 20OCT2014
2 T 11NOV2013
3 E 07JAN2015
3 R 07JAN2015
3 B 07FEB2016
3 B 25APR2016
3 B 02JUN2016
;
run;


proc sql;
   create table temp as
   select a.pid, a.diag_cd,a.diag_dt
          ,(90 le (b.diag_dt-a.diag_dt) le 365) as newvar
   from work.test as a
        join 
        work.test as b
        on a.pid=b.pid and a.diag_cd=b.diag_cd
   where a.diag_dt < b.diag_dt
   ;
quit;

The above code compares all values that match PID and DIAG_CD when the dates aren't the same. There can be multiple outputs and the more often a particular diag_cd is used the more output records. Rules for selecting which date(s) should appear in the output weren't supplied.

 

This may get what you want. If there is a newvar=1 then one of the possible many will be output.

proc sort data=temp;
   by pid diag_cd descending newvar ;
run;

data want;
   set temp;
   by pid diag_cd;
   if first.pid;
run;

View solution in original post

24 REPLIES 24
novinosrin
Tourmaline | Level 20

Hi @wj2  If you could post the expected OUTPUT WANT sample for your sample INPUT while explaining the logic and why would help. 

ballardw
Super User

Please provide what you would expect the output data to look like as well. Since you setting a value based on multiple records there are several different ways the output might look like.

 

Your second bit about different diagnosis is not very clear. Do you mean to compare Diagnosis=A with Diagnosis=B for the 90/ 365 day interval?

 

And for clarification of your 90 day rule does the following (NOT your example data)

Patient_ID diagnosis diagnosis_date 
 1         A         07JAN2015
 1         A         11MAR2015
 1         A         14APR2015

14Apr2015 count as more than 90 because it is more than 90 days from the 07JAN2015?

wj2
Quartz | Level 8 wj2
Quartz | Level 8

@novinosrin @ballardw Yes, I would like the final expected output to include a variable (e.g., newvar) for each patient indicating whether any diagnosis within a specified group of diagnoses (let's say A, B, C, or D) meets the condition above. For example, if diagnosis A meets the condition for patient 1 and diagnosis B meets the condition for patient 3, then newvar=1 for both patients. So something like this: 

Patient_ID        diagnosis       diagnosis_date        newvar      

      1                         A                  07JAN2015                  1

      2                         X                  23FEB2017                 0

      3                         B                  02JUN2016                 1

     .....

 

@ballardw, yes- your example would meet the 90 day rule since there are two occurrences of the diagnosis, and one occurrence (14Apr2015) is more than 90 days than another occurrence (07JAN2015)

 

Thank you both for the help!

ballardw
Super User

@wj2 wrote:

@novinosrin @ballardw Yes, I would like the final expected output to include a variable (e.g., newvar) for each patient indicating whether any diagnosis within a specified group of diagnoses (let's say A, B, C, or D) meets the condition above. For example, if diagnosis A meets the condition for patient 1 and diagnosis B meets the condition for patient 3, then newvar=1 for both patients. So something like this: 

Patient_ID        diagnosis       diagnosis_date        newvar      

      1                         A                  07JAN2015                  1

      2                         X                  23FEB2017                 0

      3                         B                  02JUN2016                 1

     .....

 

@ballardw, yes- your example would meet the 90 day rule since there are two occurrences of the diagnosis, and one occurrence (14Apr2015) is more than 90 days than another occurrence (07JAN2015)

 

Thank you both for the help!


And if two (or more) separate diagnosis for a single patient meet the condition what should the output look like.

wj2
Quartz | Level 8 wj2
Quartz | Level 8

@ballardw, the output would also be newvar=1 if two or more separate diagnoses for a single patient meets the condition. 

Please let me know if any other clarification is needed. 

zekeT_sasaholic
Quartz | Level 8

Ok.

I thought i'd have more time to mock this up for you but nope.  So i will share what i have for now.

This is the kind of healthcare data and work I do day in/out and know.

Usually with VRDC CMS claims data and lots of it.  I am not posting this to start debates about uses (or not) of sql, merge, join.

I'm simply crafting a possible solution to  your case/scenario and one which i feel is flexible if you have MORE than one diagnosis condition to seek and/or much more data to throw at it.

 

If you care to discuss further - reach back out.  But i'm pretty swamped leading up to SGF.  I'll be attending and presenting.

I also live, work, code in Chicago.  Look me up via WCSUG.com

 

So here is my code to share with you...

/*** read mock data ***/
data work.test; length pid $1.; length diag_cd $1.; length diag_dt 8.; format diag_dt yymmdd10.; infile datalines ; input pid $ diag_cd $ diag_dt date11.; datalines; 1 A 07JAN2015 1 B 11FEB2016 1 A 04MAY2015 1 C 07JAN2015 1 B 07MAR2015 2 X 23FEB2017 2 X 25FEB2018 2 C 13JAN2016 2 F 20OCT2014 2 T 11NOV2013 3 E 07JAN2015 3 R 07JAN2015 3 B 07FEB2016 3 B 25APR2016 3 B 02JUN2016 ; run; /*** create format per target set of diag conditions to find ***/
/*** this might be for allergies ***/ proc format; invalue $dgtarga "A" = 1 "A.1" = 1 "A.10" = 1 other = 0; run; /*** this might be for high blood pressure ***/ proc format; invalue $dgtargb "B" = 1 "B.3" = 1 "B.40" = 1 other = 0; run;

/*** push that data thru the formats to identify which PID and thus dates to then check in next step ***/
/*** in this case since each target diag is different condition - i can separate out tables ***/
/*** by keeping PID and date (of diag) - in my next step - i can run a min/max on the diff between dates for that PID ***/ data work.test_a (keep=pid diag_dt) work.test_b (keep=pid diag_dt); set work.test;

/*** if i have other diag groups to target - i'd just have more target vars ***/ format target_a 8.; format target_b 8.; target_a=input(diag_cd,$dgtarga.); target_b=input(diag_cd,$dgtargb.); if target_a=1 then output work.test_a; if target_b=1 then output work.test_b; run;
/*** what i couldnt get to is the next step which transposes and/or does a by group (on pid) ***/
/*** for the days diff between the dates ***/

So again sorry i didnt get to more of this but i have a meeting coming up and a call that came in and snagged some of my time.

 

Best

Zeke Torres

LinkedIn

 

 

 

wj2
Quartz | Level 8 wj2
Quartz | Level 8

Thank you very much @zekeT_sasaholic for the start. I would be very interested for hearing your suggestions for coding the time interval criteria and including multiple diagnoses for this scenario as well. I am a new SAS user and I am most familiar with using first/last variable coding for counting, so an approach using that method would be preferable. @ballardw @novinosrin: please let me know if you would like additional clarification for providing any suggestions on your end as well. 

novinosrin
Tourmaline | Level 20

Hi @wj2   I am afraid I haven't understood the required output well and so I was basically waiting for @ballardw  to perhaps take the lead if he was working on it. 

 

So my understanding goes as follows, 

See the below table, the diff column, count. Are you able to make out where my thinking is going??

It's one of those things that leads to assumptions when I do not have clear understanding .

 

Patient_ID diagnosis diagnosis_date d Flag count diff
1 A 07JAN2015 07JAN2015 0 0 .
1 C 07JAN2015 07JAN2015 0 0 0
1 B 07MAR2015 07JAN2015 0 0 59
1 A 04MAY2015 04MAY2015 0 1 117
1 B 11FEB2016 11FEB2016 1 2 283
2 T 11NOV2013 11NOV2013 0 0 .
2 F 20OCT2014 20OCT2014 0 1 343
2 C 13JAN2016 20OCT2014 0 1 450
2 X 23FEB2017 20OCT2014 0 1 857
2 X 25FEB2018 20OCT2014 0 1 1224
3 E 07JAN2015 07JAN2015 0 0 .
3 R 07JAN2015 07JAN2015 0 0 0
3 B 07FEB2016 07JAN2015 0 0 396
3 B 25APR2016 07JAN2015 0 0 474
3 B 02JUN2016 07JAN2015 0 0 512
ballardw
Super User

This may get a bit closer.

data work.test;
length pid $1.;
length diag_cd $1.;
length diag_dt 8.; format diag_dt yymmdd10.;
infile datalines ;
input pid $ diag_cd $ diag_dt date11.;
datalines;
1 A 07JAN2015
1 B 11FEB2016
1 A 04MAY2015
1 C 07JAN2015
1 B 07MAR2015
2 X 23FEB2017
2 X 25FEB2018
2 C 13JAN2016
2 F 20OCT2014
2 T 11NOV2013
3 E 07JAN2015
3 R 07JAN2015
3 B 07FEB2016
3 B 25APR2016
3 B 02JUN2016
;
run;


proc sql;
   create table temp as
   select a.pid, a.diag_cd,a.diag_dt
          ,(90 le (b.diag_dt-a.diag_dt) le 365) as newvar
   from work.test as a
        join 
        work.test as b
        on a.pid=b.pid and a.diag_cd=b.diag_cd
   where a.diag_dt < b.diag_dt
   ;
quit;

The above code compares all values that match PID and DIAG_CD when the dates aren't the same. There can be multiple outputs and the more often a particular diag_cd is used the more output records. Rules for selecting which date(s) should appear in the output weren't supplied.

 

This may get what you want. If there is a newvar=1 then one of the possible many will be output.

proc sort data=temp;
   by pid diag_cd descending newvar ;
run;

data want;
   set temp;
   by pid diag_cd;
   if first.pid;
run;
wj2
Quartz | Level 8 wj2
Quartz | Level 8

Hi @ballardw, thank you for the suggestion! I will give this a try. 

Would you be able to also provide a suggestion using a data step procedure with first/last variables? I am little more familiar with this approach as a newer SAS user. 

 

To respond to @novinosrin's comment: I have provided additional clarification below on what the output should be like. Please note diff should be the difference between diagnosis dates of the SAME diagnosis. Please also note that the criterion is that the variable will be counted if the difference between ANY TWO diagnosis dates is >=90 days and <=365 days (not necessarily the difference between the earliest date and the last date). Please see Patient_ID 4 for an example of this scenario below (highlighted in red). 

 

Patient_ID

diagnosis

diagnosis_date

d

diff

count

1

A

07JAN2015

07JAN2015

0

0

1

C

07JAN2015

07JAN2015

0

0

1

B

07MAR2015

07MAR2015

0

0

1

A

04MAY2015

07JAN2015

117

1

1

B

11FEB2016

07MAR2015

333

2

2

T

11NOV2013

11NOV2013

0

0

2

F

20OCT2014

20OCT2014

0

0

2

C

13JAN2016

13JAN2016

0

0

2

X

23FEB2017

23FEB2017

0

0

2

X

25FEB2018

23FEB2017

367

1

3

E

07JAN2015

07JAN2015

0

0

3

R

16JAN2017

16JAN2017

0

0

3

B

07FEB2016

07FEB2016

0

0

3

B

25APR2016

07FEB2016

78

0

3

B

02JUN2016

07FEB2016

116

1

4

C

03MAR2017

03MAR2017

0

0

4

C

01MAY2017

03MAR2017

59

0

4

C

15APR2018

01MAY2017

349

1

novinosrin
Tourmaline | Level 20

@wj2   OK got it. Thank you. It's almost bed time CST and I don't have SAS software at home. I'll work on this 1st thing in the morning i.e that is if nobody takes a stab at it. I honestly believe somebody would have solved while I'm sleeping. 🙂

novinosrin
Tourmaline | Level 20

Good morning @wj2  

 


data have;
input Patient_ID        diagnosis $      diagnosis_date  :date9.; 
format  diagnosis_date  date9.; 
cards;
      1                         A                  07JAN2015
      1                         B                 11FEB2016
      1                         A                  04MAY2015
      1                         C                  07JAN2015
      1                         B                  07MAR2015
      2                         X                  23FEB2017
      2                         X                  25FEB2018
      2                         C                 13JAN2016
      2                         F                  20OCT2014
      2                         T                   11NOV2013
      3                         E                  07JAN2015
      3                         R                  07JAN2015
      3                         B                  07FEB2016
      3                         B                  25APR2016
      3                         B                  02JUN2016
	  4	C	3-Mar-17	3-Mar-17
4	C	1-May-17	3-Mar-17
4	C	15-Apr-18	1-May-17
	  ;

	  
proc sql;
create table want as
select a.*,sum(90<=intck('days',a.diagnosis_date,b.diagnosis_date)<=365)>=2 as flag
from have a , have b
where a.Patient_ID=b.Patient_ID and a.diagnosis=b.diagnosis and b.diagnosis_date>a.diagnosis_date
group by a.patient_id
order by a.Patient_ID,a.diagnosis,diagnosis_date;
quit;
novinosrin
Tourmaline | Level 20

And if you want the HAVE table as is in full , we switch to left join and distinct 

 

proc sql;
create table want1 as
select distinct a.*,sum(90<=intck('days',a.diagnosis_date,b.diagnosis_date)<=365)>=2 as flag
from have a left join have b
on a.Patient_ID=b.Patient_ID and a.diagnosis=b.diagnosis and b.diagnosis_date>a.diagnosis_date
group by a.patient_id
order by a.Patient_ID,a.diagnosis,diagnosis_date;
quit;
novinosrin
Tourmaline | Level 20

A Datastep is of course more intuitive for its sequential process

 

 


data have;
input Patient_ID        diagnosis $      diagnosis_date  :date9.; 
format  diagnosis_date  date9.; 
cards;
      1                         A                  07JAN2015
      1                         B                 11FEB2016
      1                         A                  04MAY2015
      1                         C                  07JAN2015
      1                         B                  07MAR2015
      2                         X                  23FEB2017
      2                         X                  25FEB2018
      2                         C                 13JAN2016
      2                         F                  20OCT2014
      2                         T                   11NOV2013
      3                         E                  07JAN2015
      3                         R                  07JAN2015
      3                         B                  07FEB2016
      3                         B                  25APR2016
      3                         B                  02JUN2016
	  4	C	3-Mar-17	3-Mar-17
4	C	1-May-17	3-Mar-17
4	C	15-Apr-18	1-May-17
	  ;

	  proc sort data=have out=_have;
by Patient_ID diagnosis diagnosis_date;
run;

data want;
if _n_=1 then do;
if 0 then set have(rename=(diagnosis_date=d));
   dcl hash H (dataset:'have(rename=(diagnosis_date=d))',multidata:'y') ;
   h.definekey  ("Patient_ID","diagnosis") ;
   h.definedata ("d") ;
   h.definedone () ;
end;
set _have;
c=0;
by Patient_ID diagnosis ; 
if first.Patient_ID then c1=0;
do rc=h.find() by 0 while(rc=0);
if d>diagnosis_date and 90<=intck('days',diagnosis_date,d)<=365 then c+1;
rc=h.find_next();
end;
c1+c;
drop rc d;
run;

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 24 replies
  • 6709 views
  • 4 likes
  • 4 in conversation