Hello community,
I am working with a large electronic health record data set (~1 million patients) and I need to determine the number of patients that have ANY diagnosis in a group of diagnoses (e.g., D, E, or F) within 365 days AFTER the date of ANY diagnosis within a different group of diagnoses (e.g., A, B, or C).
In other words: diagnosis date of diagnosis D or E or F - diagnosis date of A or B or C = le 365 days. The condition can be met for any pair of diagnoses between the two groups (e.g., D-A, D-B, D-C, E-A, etc.).
Below is an example data set with some different cases for when the criterion should and should not be met. patient id = pid; diagnosis code= diag_cd; diagnosis date=diag_dt.
data work.test;
length pid $1.;
length diag $1.;
length diag_dt 8.; format diag_dt yymmdd10.;
infile datalines ;
input pid $ diag $ diag_dt date11.;
datalines;
1 A 10JAN2013
1 D 10JAN2013 /*condition met because diagnosis D occurs on same date as A*/
1 G 10JAN2013
2 F 02JUN2017
2 J 02JUN2017
2 B 01JUL2017 /*condition NOT met because diagnosis F occurs before B*/
2 J 01JUL2017
3 D 23OCT2014
3 C 17DEC2014
3 X 11JAN2015
3 D 25FEB2015 /*condition met because D occurs within 365 days after C*/
4 A 12MAR2015
4 Z 12MAR2015
4 A 20FEB2016
4 E 07JAN2016 /*condition met because E occurs within 365 days after A on 12MAR2015*/
5 C 16JAN2017
5 A 28JAN2017
5 F 01FEB2018 /*condition NOT met because F occurs after 365 days from C and A*/
6 B 13NOV2016
6 B 10DEC2016
6 D 10DEC2016 /*condition met because D occurs within 365 days of B on 13NOV2016 or 10DEC2016*/
6 B 12DEC2017
7 A 01JAN2013
7 A 08JAN2014
7 D 07MAY2014 /*condition met because D occurs within 365 days of A on 08JAN2014 (but not A on 01JAN2013*/
7 A 15JUN2015
;
run;
Ultimately, I would like the output data set to have a single row per each patient with a new variable (e.g., newvar) indicating whether the condition is met (diagnosis date of D or E or F - diagnosis date of A or B or C = le 365 days). Below is an example based on the example data set above.
pid | newvar |
1 | 1 |
2 | 0 |
3 | 1 |
4 | 1 |
5 | 1 |
6 | 0 |
7 | 1 |
Any help with this problem would be much appreciated. I'm assuming it would require some combination of proc sql and a data step with first/last variables; however, my programming skills are not yet advanced enough. Please let me know if I can provide any additional clarification or information. Thank you!
Should be
5 |
0 |
6 |
1 |
?
How about this one ?
data work.test;
length pid $1.;
length diag $1.;
length diag_dt 8.; format diag_dt yymmdd10.;
infile datalines ;
input pid $ diag $ diag_dt date11.;
format diag_dt date11.;
datalines;
1 A 10JAN2013
1 D 10JAN2013 /*condition met because diagnosis D occurs on same date as A*/
1 G 10JAN2013
2 F 02JUN2017
2 J 02JUN2017
2 B 01JUL2017 /*condition NOT met because diagnosis F occurs before B*/
2 J 01JUL2017
3 D 23OCT2014
3 C 17DEC2014
3 X 11JAN2015
3 D 25FEB2015 /*condition met because D occurs within 365 days after C*/
4 A 12MAR2015
4 Z 12MAR2015
4 A 20FEB2016
4 E 07JAN2016 /*condition met because E occurs within 365 days after A on 12MAR2015*/
5 C 16JAN2017
5 A 28JAN2017
5 F 01FEB2018 /*condition NOT met because F occurs after 365 days from C and A*/
6 B 13NOV2016
6 B 10DEC2016
6 D 10DEC2016 /*condition met because D occurs within 365 days of B on 13NOV2016 or 10DEC2016*/
6 B 12DEC2017
7 A 01JAN2013
7 A 08JAN2014
7 D 07MAY2014 /*condition met because D occurs within 365 days of A on 08JAN2014 (but not A on 01JAN2013*/
7 A 15JUN2015
;
run;
proc sort data=test out=have;
by pid diag_dt;
run;
data want;
array _c{99} $ 32 _temporary_;
array _n{99} _temporary_;
do i=1 by 1 until(last.pid);
set have;
by pid;
_c{i}=diag;
_n{i}=diag_dt;
end;
new_var=0;
do m=1 to i-1;
do n=m+1 to i;
if _c{n} in ('D' 'E' 'F') and _c{m} in ('A' 'B' 'C') then do;
dif_day=_n{n}-_n{m};
if dif_day le 365 then do;new_var=1;yes=1;leave;end;
end;
if yes then leave;
end;
end;
keep pid new_var;
run;
Should be
5 |
0 |
6 |
1 |
?
How about this one ?
data work.test;
length pid $1.;
length diag $1.;
length diag_dt 8.; format diag_dt yymmdd10.;
infile datalines ;
input pid $ diag $ diag_dt date11.;
format diag_dt date11.;
datalines;
1 A 10JAN2013
1 D 10JAN2013 /*condition met because diagnosis D occurs on same date as A*/
1 G 10JAN2013
2 F 02JUN2017
2 J 02JUN2017
2 B 01JUL2017 /*condition NOT met because diagnosis F occurs before B*/
2 J 01JUL2017
3 D 23OCT2014
3 C 17DEC2014
3 X 11JAN2015
3 D 25FEB2015 /*condition met because D occurs within 365 days after C*/
4 A 12MAR2015
4 Z 12MAR2015
4 A 20FEB2016
4 E 07JAN2016 /*condition met because E occurs within 365 days after A on 12MAR2015*/
5 C 16JAN2017
5 A 28JAN2017
5 F 01FEB2018 /*condition NOT met because F occurs after 365 days from C and A*/
6 B 13NOV2016
6 B 10DEC2016
6 D 10DEC2016 /*condition met because D occurs within 365 days of B on 13NOV2016 or 10DEC2016*/
6 B 12DEC2017
7 A 01JAN2013
7 A 08JAN2014
7 D 07MAY2014 /*condition met because D occurs within 365 days of A on 08JAN2014 (but not A on 01JAN2013*/
7 A 15JUN2015
;
run;
proc sort data=test out=have;
by pid diag_dt;
run;
data want;
array _c{99} $ 32 _temporary_;
array _n{99} _temporary_;
do i=1 by 1 until(last.pid);
set have;
by pid;
_c{i}=diag;
_n{i}=diag_dt;
end;
new_var=0;
do m=1 to i-1;
do n=m+1 to i;
if _c{n} in ('D' 'E' 'F') and _c{m} in ('A' 'B' 'C') then do;
dif_day=_n{n}-_n{m};
if dif_day le 365 then do;new_var=1;yes=1;leave;end;
end;
if yes then leave;
end;
end;
keep pid new_var;
run;
@Ksharp, this is perfect! Thank you!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.