Hello community,
I am working with a large electronic health record data set (~1 million patients) and I need to determine the number of patients that have ANY diagnosis in a group of diagnoses (e.g., D, E, or F) within 365 days AFTER the date of ANY diagnosis within a different group of diagnoses (e.g., A, B, or C).
In other words: diagnosis date of diagnosis D or E or F - diagnosis date of A or B or C = le 365 days. The condition can be met for any pair of diagnoses between the two groups (e.g., D-A, D-B, D-C, E-A, etc.).
Below is an example data set with some different cases for when the criterion should and should not be met. patient id = pid; diagnosis code= diag_cd; diagnosis date=diag_dt.
data work.test;
length pid $1.;
length diag $1.;
length diag_dt 8.; format diag_dt yymmdd10.;
infile datalines ;
input pid $ diag $ diag_dt date11.;
datalines;
1 A 10JAN2013
1 D 10JAN2013 /*condition met because diagnosis D occurs on same date as A*/
1 G 10JAN2013
2 F 02JUN2017
2 J 02JUN2017
2 B 01JUL2017 /*condition NOT met because diagnosis F occurs before B*/
2 J 01JUL2017
3 D 23OCT2014
3 C 17DEC2014
3 X 11JAN2015
3 D 25FEB2015 /*condition met because D occurs within 365 days after C*/
4 A 12MAR2015
4 Z 12MAR2015
4 A 20FEB2016
4 E 07JAN2016 /*condition met because E occurs within 365 days after A on 12MAR2015*/
5 C 16JAN2017
5 A 28JAN2017
5 F 01FEB2018 /*condition NOT met because F occurs after 365 days from C and A*/
6 B 13NOV2016
6 B 10DEC2016
6 D 10DEC2016 /*condition met because D occurs within 365 days of B on 13NOV2016 or 10DEC2016*/
6 B 12DEC2017
7 A 01JAN2013
7 A 08JAN2014
7 D 07MAY2014 /*condition met because D occurs within 365 days of A on 08JAN2014 (but not A on 01JAN2013*/
7 A 15JUN2015
;
run;
Ultimately, I would like the output data set to have a single row per each patient with a new variable (e.g., newvar) indicating whether the condition is met (diagnosis date of D or E or F - diagnosis date of A or B or C = le 365 days). Below is an example based on the example data set above.
pid | newvar |
1 | 1 |
2 | 0 |
3 | 1 |
4 | 1 |
5 | 1 |
6 | 0 |
7 | 1 |
Any help with this problem would be much appreciated. I'm assuming it would require some combination of proc sql and a data step with first/last variables; however, my programming skills are not yet advanced enough. Please let me know if I can provide any additional clarification or information. Thank you!
Should be
5 |
0 |
6 |
1 |
?
How about this one ?
data work.test;
length pid $1.;
length diag $1.;
length diag_dt 8.; format diag_dt yymmdd10.;
infile datalines ;
input pid $ diag $ diag_dt date11.;
format diag_dt date11.;
datalines;
1 A 10JAN2013
1 D 10JAN2013 /*condition met because diagnosis D occurs on same date as A*/
1 G 10JAN2013
2 F 02JUN2017
2 J 02JUN2017
2 B 01JUL2017 /*condition NOT met because diagnosis F occurs before B*/
2 J 01JUL2017
3 D 23OCT2014
3 C 17DEC2014
3 X 11JAN2015
3 D 25FEB2015 /*condition met because D occurs within 365 days after C*/
4 A 12MAR2015
4 Z 12MAR2015
4 A 20FEB2016
4 E 07JAN2016 /*condition met because E occurs within 365 days after A on 12MAR2015*/
5 C 16JAN2017
5 A 28JAN2017
5 F 01FEB2018 /*condition NOT met because F occurs after 365 days from C and A*/
6 B 13NOV2016
6 B 10DEC2016
6 D 10DEC2016 /*condition met because D occurs within 365 days of B on 13NOV2016 or 10DEC2016*/
6 B 12DEC2017
7 A 01JAN2013
7 A 08JAN2014
7 D 07MAY2014 /*condition met because D occurs within 365 days of A on 08JAN2014 (but not A on 01JAN2013*/
7 A 15JUN2015
;
run;
proc sort data=test out=have;
by pid diag_dt;
run;
data want;
array _c{99} $ 32 _temporary_;
array _n{99} _temporary_;
do i=1 by 1 until(last.pid);
set have;
by pid;
_c{i}=diag;
_n{i}=diag_dt;
end;
new_var=0;
do m=1 to i-1;
do n=m+1 to i;
if _c{n} in ('D' 'E' 'F') and _c{m} in ('A' 'B' 'C') then do;
dif_day=_n{n}-_n{m};
if dif_day le 365 then do;new_var=1;yes=1;leave;end;
end;
if yes then leave;
end;
end;
keep pid new_var;
run;
Should be
5 |
0 |
6 |
1 |
?
How about this one ?
data work.test;
length pid $1.;
length diag $1.;
length diag_dt 8.; format diag_dt yymmdd10.;
infile datalines ;
input pid $ diag $ diag_dt date11.;
format diag_dt date11.;
datalines;
1 A 10JAN2013
1 D 10JAN2013 /*condition met because diagnosis D occurs on same date as A*/
1 G 10JAN2013
2 F 02JUN2017
2 J 02JUN2017
2 B 01JUL2017 /*condition NOT met because diagnosis F occurs before B*/
2 J 01JUL2017
3 D 23OCT2014
3 C 17DEC2014
3 X 11JAN2015
3 D 25FEB2015 /*condition met because D occurs within 365 days after C*/
4 A 12MAR2015
4 Z 12MAR2015
4 A 20FEB2016
4 E 07JAN2016 /*condition met because E occurs within 365 days after A on 12MAR2015*/
5 C 16JAN2017
5 A 28JAN2017
5 F 01FEB2018 /*condition NOT met because F occurs after 365 days from C and A*/
6 B 13NOV2016
6 B 10DEC2016
6 D 10DEC2016 /*condition met because D occurs within 365 days of B on 13NOV2016 or 10DEC2016*/
6 B 12DEC2017
7 A 01JAN2013
7 A 08JAN2014
7 D 07MAY2014 /*condition met because D occurs within 365 days of A on 08JAN2014 (but not A on 01JAN2013*/
7 A 15JUN2015
;
run;
proc sort data=test out=have;
by pid diag_dt;
run;
data want;
array _c{99} $ 32 _temporary_;
array _n{99} _temporary_;
do i=1 by 1 until(last.pid);
set have;
by pid;
_c{i}=diag;
_n{i}=diag_dt;
end;
new_var=0;
do m=1 to i-1;
do n=m+1 to i;
if _c{n} in ('D' 'E' 'F') and _c{m} in ('A' 'B' 'C') then do;
dif_day=_n{n}-_n{m};
if dif_day le 365 then do;new_var=1;yes=1;leave;end;
end;
if yes then leave;
end;
end;
keep pid new_var;
run;
@Ksharp, this is perfect! Thank you!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.