Hello Community,
I am working with a large electronic health record data set (~1 million patients) and I need to create a variable that indicates whether patients have ANY diagnosis in a group of diagnoses (e.g., A, B, or C) within 365 days (before or after) the date of ANY diagnosis within a different group of diagnoses (e.g., D, E, or F).
In other words: (diagnosis date of diagnosis A or B or C) – (diagnosis date of D or E or F) = le 365 days or ge 365 days. The condition can be met for any pair of diagnoses between the two groups (e.g., D-A, D-B, D-C, E-A, etc.).
Another critical issue is that one of the groups of diagnoses that I am interested in contains approximately 2,500 different diagnoses and corresponding diagnosis dates. Therefore, I need the program to be able to somehow incorporate this large amount of data in an efficient manner in order to limit CPU resources (maybe via creating a separate data set for these diagnoses to read in?).
Below is a simplified example data set with some different cases for when the criterion should and should not be met. patient id = pid; diagnosis code= diag_cd; diagnosis date=diag_dt.
data work.test;
length pid $1.;
length diag $1.;
length diag_dt 8.; format diag_dt yymmdd10.;
infile datalines ;
input pid $ diag $ diag_dt date11.;
datalines;
1 A 10JAN2013
1 D 10JAN2013 /*condition met because diagnosis D occurs on same date as A*/
1 G 10JAN2013
2 F 02JUN2017
2 J 02JUN2017
2 B 01JUL2017 /*condition met because diagnosis dates of F and B are within 365 days*/
2 J 01JUL2017
3 E 23OCT2012
3 C 17DEC2014
3 X 11JAN2015
3 D 25FEB2015 /*condition met because D occurs within 365 days of C*/
4 A 12MAR2015
4 Z 12MAR2015
4 A 20FEB2016
4 E 07JAN2016 /*condition met because E occurs within 365 days of A on 12MAR2015*/
5 C 16JAN2017
5 A 28JAN2017
5 F 01FEB2018 /*condition NOT met because F occurs after 365 days from C and A*/
6 B 13NOV2016
6 B 10DEC2016
6 D 10DEC2016 /*condition met because D occurs within 365 days of B on 13NOV2016 or 10DEC2016*/
6 B 12DEC2017
7 F 01JAN2013
7 D 08JUN2013
7 C 07OCT2014 /*condition not met because C occurs after 365 days from F and D*/
;
run;
Ultimately, I would like the output data set to have a single row per each patient with a new variable (e.g., newvar) indicating whether the condition is met (diagnosis date of A or B or C - diagnosis date of D or E or F = le/ge 365 days). Below is an example based on the example data set above.
pid | newvar |
1 | 1 |
2 | 1 |
3 | 1 |
4 | 1 |
5 | 0 |
6 | 1 |
7 | 0 |
Any help with this problem would be much appreciated. Please let me know if I can provide any additional clarification or information. Thank you!
Hi @wj2
This can be done with a simple data step, where each diagnosis_date in a group (ABC) or DEF) is compared to the last previous diagnosis_date in the other group. No need for preprocessing, lag functions or hash.
The only reguirement is that the input data set is sorted on pid and diag_dt, so I have included that in the code as a precaution in case your real data set is unsorted.
I only one group is present for a given pid, newvar will be set to 0. I hope this vill solve your problem.
proc sort data=test;
by pid diag_dt;
run;
data want (keep=pid newvar);
set test;
by pid;
retain lastABC lastDEF newvar;
if first.pid then do;
lastABC = 0; lastDEF = 0; newvar = 0;
end;
if diag in ('A','B','C') then do;
if lastDEF > 0 then newvar = sum(newvar,(diag_dt-lastDEF) <= 365);
lastABC = diag_dt;
end;
else if diag in ('D','E','F') then do;
if lastABC > 0 then newvar = sum(newvar,(diag_dt-lastABC) <= 365);
lastDEF = diag_dt;
end;
if last.pid then do;
newvar = (newvar > 0);
output;
end;
run;
@Reeza after filtering, it looks like there are 35,324 records for the first set of diagnoses and 9,576,319 records for the second set of diagnoses. Any suggestions would be much appreciated.
@Reeza thank you! In the past I have had disk space problems with hash solutions so I appreciate the SQL approach.
I would actually like to have the newvar indicate whether the condition is met among the entire sample. That is, newvar=0 for those where the condition is not met. Would you mind suggesting how the code could be modified for that? Also, how would I indicate the two sets of diagnoses (e.g., set 1 containing diagnoses A, B, and C and set 2 containing diagnoses D, E, and F) that I am interested in finding the difference between diagnosis dates? Would an INTCK function be necessary to handle negative values in the difference between dates? Thanks again!
Hi @wj2
You might give my code a try. It is intended to be very efficient (much more than any SQL approach with between-conditions) and return your wanted-data set in the final form with 1/0 covering the entire sample. It should work on your test data without modifications.
Hi @ErikLund_Jensen Yes, thank you! Your approach appears to be exactly what I need but I will test it to be sure today. As a new SAS learner, I am just interested in learning multiple approaches so I was wondering about a SQL approach as well!
I can imagine a couple ways to implement the tests you want to do, but what to you want your output data set to look like?
@mkeintz I would love to hear your thoughts. I would like the output data set to have a single row per each patient with a new variable (e.g., newvar) indicating whether the condition is met (diagnosis date of A or B or C - diagnosis date of D or E or F = le/ge 365 days) for every patient in the dataset. Below is an example based on the example data set above.
pid | newvar |
1 | 1 |
2 | 1 |
3 | 1 |
4 | 1 |
5 | 0 |
6 | 1 |
7 | 0 |
Hi @wj2
This can be done with a simple data step, where each diagnosis_date in a group (ABC) or DEF) is compared to the last previous diagnosis_date in the other group. No need for preprocessing, lag functions or hash.
The only reguirement is that the input data set is sorted on pid and diag_dt, so I have included that in the code as a precaution in case your real data set is unsorted.
I only one group is present for a given pid, newvar will be set to 0. I hope this vill solve your problem.
proc sort data=test;
by pid diag_dt;
run;
data want (keep=pid newvar);
set test;
by pid;
retain lastABC lastDEF newvar;
if first.pid then do;
lastABC = 0; lastDEF = 0; newvar = 0;
end;
if diag in ('A','B','C') then do;
if lastDEF > 0 then newvar = sum(newvar,(diag_dt-lastDEF) <= 365);
lastABC = diag_dt;
end;
else if diag in ('D','E','F') then do;
if lastABC > 0 then newvar = sum(newvar,(diag_dt-lastABC) <= 365);
lastDEF = diag_dt;
end;
if last.pid then do;
newvar = (newvar > 0);
output;
end;
run;
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.