BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
wj2
Quartz | Level 8 wj2
Quartz | Level 8

Hello Community,

 

I am working with a large electronic health record data set (~1 million patients) and I need to create a variable that indicates whether patients have ANY diagnosis in a group of diagnoses (e.g., A, B, or C) within 365 days (before or after) the date of ANY diagnosis within a different group of diagnoses (e.g., D, E, or F).

 

In other words: (diagnosis date of diagnosis A or B or C) – (diagnosis date of D or E or F) = le 365 days or ge 365 days. The condition can be met for any pair of diagnoses between the two groups (e.g., D-A, D-B, D-C, E-A, etc.).

 

Another critical issue is that one of the groups of diagnoses that I am interested in contains approximately 2,500 different diagnoses and corresponding diagnosis dates. Therefore, I need the program to be able to somehow incorporate this large amount of data in an efficient manner in order to limit CPU resources (maybe via creating a separate data set for these diagnoses to read in?).

 

Below is a simplified example data set with some different cases for when the criterion should and should not be met. patient id = pid; diagnosis code= diag_cd; diagnosis date=diag_dt.

data work.test;
length pid $1.;
length diag $1.;
length diag_dt 8.; format diag_dt yymmdd10.;
infile datalines ;
input pid $ diag $ diag_dt date11.;
datalines;
1 A 10JAN2013
1 D 10JAN2013 /*condition met because diagnosis D occurs on same date as A*/
1 G 10JAN2013
2 F 02JUN2017 
2 J 02JUN2017
2 B 01JUL2017 /*condition met because diagnosis dates of F and B are within 365 days*/
2 J 01JUL2017
3 E 23OCT2012
3 C 17DEC2014
3 X 11JAN2015
3 D 25FEB2015 /*condition met because D occurs within 365 days of C*/
4 A 12MAR2015
4 Z 12MAR2015
4 A 20FEB2016 
4 E 07JAN2016 /*condition met because E occurs within 365 days of A on 12MAR2015*/
5 C 16JAN2017
5 A 28JAN2017
5 F 01FEB2018 /*condition NOT met because F occurs after 365 days from C and A*/
6 B 13NOV2016
6 B 10DEC2016
6 D 10DEC2016 /*condition met because D occurs within 365 days of B on 13NOV2016 or 10DEC2016*/
6 B 12DEC2017
7 F 01JAN2013
7 D 08JUN2013
7 C 07OCT2014 /*condition not met because C occurs after 365 days from F and D*/
;
run;

Ultimately, I would like the output data set to have a single row per each patient with a new variable (e.g., newvar) indicating whether the condition is met (diagnosis date of A or B or C - diagnosis date of D or E or F = le/ge 365 days). Below is an example based on the example data set above.

pid

newvar

1

1

2

1

3

1

4

1

5

0

6

1

7

0

 

Any help with this problem would be much appreciated. Please let me know if I can provide any additional clarification or information. Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
ErikLund_Jensen
Rhodochrosite | Level 12

Hi @wj2 

 

This can be done with a simple data step, where each diagnosis_date in a group (ABC) or DEF) is compared to the last previous diagnosis_date in the other group. No need for preprocessing, lag functions or hash.

 

The only reguirement is that the input data set is sorted on pid and diag_dt, so I have included that in the code as a precaution in case your real data set is unsorted.

 

I only one group is present for a given pid, newvar will be set to 0. I hope this vill solve your problem.

 

proc sort data=test;
by pid diag_dt;
run;
data want (keep=pid newvar); set test; by pid; retain lastABC lastDEF newvar;
if first.pid then do; lastABC = 0; lastDEF = 0; newvar = 0; end;
if diag in ('A','B','C') then do; if lastDEF > 0 then newvar = sum(newvar,(diag_dt-lastDEF) <= 365); lastABC = diag_dt; end;
else if diag in ('D','E','F') then do; if lastABC > 0 then newvar = sum(newvar,(diag_dt-lastABC) <= 365); lastDEF = diag_dt; end;
if last.pid then do; newvar = (newvar > 0); output; end; run;

 

 

View solution in original post

9 REPLIES 9
Reeza
Super User
If you use SQL to first filter your data set to just those set of diagnosis how many records do you have?

proc sql;
create table simplified as
select * from table1 where diag in (select diag from table2);
quit;
wj2
Quartz | Level 8 wj2
Quartz | Level 8

@Reeza after filtering, it looks like there are 35,324 records for the first set of diagnoses and 9,576,319 records for the second set of diagnoses. Any suggestions would be much appreciated. 

Reeza
Super User
You need to post one of each of your input data sets. But honestly this question gets asked a lot on here there's many solutions. Usually the most efficient is a hash solution but I don't code in hash 😞

The second option, which is pretty simple is a SQL join.

It would look something like this assuming you've filtered each data set for what you need. Modify this for what your data and if you have issues let us know.

proc sql;
create table want as
select a.*, b.diag as diag_second, b.date as date_second_event, b.diag_date - a.diag_date as date_diff
from have1 as a
left join have2 as b
on a.patientid=b.patientid

where b.date-a.date < 365
order by a.pid, a.diag_date, b.diag_date;
quit;

You may end up with multiple matches here - ie if you have 2 diagnosis within 365 days it will list all three.
wj2
Quartz | Level 8 wj2
Quartz | Level 8

@Reeza thank you! In the past I have had disk space problems with hash solutions so I appreciate the SQL approach.

 

I would actually like to have the newvar indicate whether the condition is met among the entire sample. That is, newvar=0 for those where the condition is not met. Would you mind suggesting how the code could be modified for that? Also, how would I indicate the two sets of diagnoses (e.g., set 1 containing diagnoses A, B, and C and set 2 containing diagnoses D, E, and F) that I am interested in finding the difference between diagnosis dates? Would an INTCK function be necessary to handle negative values in the difference between dates? Thanks again!

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @wj2 

 

You might give my code a try. It is intended to be very efficient (much more than any SQL approach with between-conditions) and return your wanted-data set in the final form with 1/0 covering the entire sample. It should work on your test data without modifications.

wj2
Quartz | Level 8 wj2
Quartz | Level 8

Hi @ErikLund_Jensen Yes, thank you! Your approach appears to be exactly what I need but I will test it to be sure today. As a new SAS learner, I am just interested in learning multiple approaches so I was wondering about a SQL approach as well! 

mkeintz
PROC Star

I can imagine a couple ways to implement the tests you want to do, but what to you want your output  data set to look like?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
wj2
Quartz | Level 8 wj2
Quartz | Level 8

@mkeintz I would love to hear your thoughts. I would like the output data set to have a single row per each patient with a new variable (e.g., newvar) indicating whether the condition is met (diagnosis date of A or B or C - diagnosis date of D or E or F = le/ge 365 days) for every  patient in the dataset. Below is an example based on the example data set above.

 

pid

newvar

1

1

2

1

3

1

4

1

5

0

6

1

7

0

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @wj2 

 

This can be done with a simple data step, where each diagnosis_date in a group (ABC) or DEF) is compared to the last previous diagnosis_date in the other group. No need for preprocessing, lag functions or hash.

 

The only reguirement is that the input data set is sorted on pid and diag_dt, so I have included that in the code as a precaution in case your real data set is unsorted.

 

I only one group is present for a given pid, newvar will be set to 0. I hope this vill solve your problem.

 

proc sort data=test;
by pid diag_dt;
run;
data want (keep=pid newvar); set test; by pid; retain lastABC lastDEF newvar;
if first.pid then do; lastABC = 0; lastDEF = 0; newvar = 0; end;
if diag in ('A','B','C') then do; if lastDEF > 0 then newvar = sum(newvar,(diag_dt-lastDEF) <= 365); lastABC = diag_dt; end;
else if diag in ('D','E','F') then do; if lastABC > 0 then newvar = sum(newvar,(diag_dt-lastABC) <= 365); lastDEF = diag_dt; end;
if last.pid then do; newvar = (newvar > 0); output; end; run;

 

 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 597 views
  • 1 like
  • 4 in conversation