Solved: Re: Counting groups of variables within time frame

wj2 · Posted 04-22-2019 09:05 AM

Hello community,

I am working with a large electronic health record data set (~1 million patients) and I need to determine the number of patients that have ANY diagnosis in a group of diagnoses (e.g., D, E, or F) within 365 days AFTER the date of ANY diagnosis within a different group of diagnoses (e.g., A, B, or C).

In other words: diagnosis date of diagnosis D or E or F - diagnosis date of A or B or C = le 365 days. The condition can be met for any pair of diagnoses between the two groups (e.g., D-A, D-B, D-C, E-A, etc.).

Below is an example data set with some different cases for when the criterion should and should not be met. patient id = pid; diagnosis code= diag_cd; diagnosis date=diag_dt.

data work.test;
length pid $1.;
length diag $1.;
length diag_dt 8.; format diag_dt yymmdd10.;
infile datalines ;
input pid $ diag $ diag_dt date11.;
datalines;
1 A 10JAN2013
1 D 10JAN2013 /*condition met because diagnosis D occurs on same date as A*/
1 G 10JAN2013
2 F 02JUN2017 
2 J 02JUN2017
2 B 01JUL2017 /*condition NOT met because diagnosis F occurs before B*/
2 J 01JUL2017
3 D 23OCT2014
3 C 17DEC2014
3 X 11JAN2015
3 D 25FEB2015 /*condition met because D occurs within 365 days after C*/
4 A 12MAR2015
4 Z 12MAR2015
4 A 20FEB2016 
4 E 07JAN2016 /*condition met because E occurs within 365 days after A on 12MAR2015*/
5 C 16JAN2017
5 A 28JAN2017
5 F 01FEB2018 /*condition NOT met because F occurs after 365 days from C and A*/
6 B 13NOV2016
6 B 10DEC2016
6 D 10DEC2016 /*condition met because D occurs within 365 days of B on 13NOV2016 or 10DEC2016*/
6 B 12DEC2017
7 A 01JAN2013
7 A 08JAN2014
7 D 07MAY2014 /*condition met because D occurs within 365 days of A on 08JAN2014 (but not A on 01JAN2013*/
7 A 15JUN2015
;
run;

Ultimately, I would like the output data set to have a single row per each patient with a new variable (e.g., newvar) indicating whether the condition is met (diagnosis date of D or E or F - diagnosis date of A or B or C = le 365 days). Below is an example based on the example data set above.

pid	newvar
1	1
2	0
3	1
4	1
5	1
6	0
7	1

Any help with this problem would be much appreciated. I'm assuming it would require some combination of proc sql and a data step with first/last variables; however, my programming skills are not yet advanced enough. Please let me know if I can provide any additional clarification or information. Thank you!

Ksharp · Posted 04-22-2019 10:40 AM

Should be

5	0
6	1

?

How about this one ?

data work.test;
length pid $1.;
length diag $1.;
length diag_dt 8.; format diag_dt yymmdd10.;
infile datalines ;
input pid $ diag $ diag_dt date11.;
format diag_dt date11.;
datalines;
1 A 10JAN2013
1 D 10JAN2013 /*condition met because diagnosis D occurs on same date as A*/
1 G 10JAN2013
2 F 02JUN2017 
2 J 02JUN2017
2 B 01JUL2017 /*condition NOT met because diagnosis F occurs before B*/
2 J 01JUL2017
3 D 23OCT2014
3 C 17DEC2014
3 X 11JAN2015
3 D 25FEB2015 /*condition met because D occurs within 365 days after C*/
4 A 12MAR2015
4 Z 12MAR2015
4 A 20FEB2016 
4 E 07JAN2016 /*condition met because E occurs within 365 days after A on 12MAR2015*/
5 C 16JAN2017
5 A 28JAN2017
5 F 01FEB2018 /*condition NOT met because F occurs after 365 days from C and A*/
6 B 13NOV2016
6 B 10DEC2016
6 D 10DEC2016 /*condition met because D occurs within 365 days of B on 13NOV2016 or 10DEC2016*/
6 B 12DEC2017
7 A 01JAN2013
7 A 08JAN2014
7 D 07MAY2014 /*condition met because D occurs within 365 days of A on 08JAN2014 (but not A on 01JAN2013*/
7 A 15JUN2015
;
run;
proc sort data=test out=have;
by pid diag_dt;
run;
data want;
 array _c{99} $ 32 _temporary_;
 array _n{99} _temporary_;
 do i=1 by 1 until(last.pid);
  set have;
  by pid;
  _c{i}=diag;
  _n{i}=diag_dt;
 end;
new_var=0;
 do m=1 to i-1;
   do n=m+1 to i;
    if _c{n} in ('D' 'E' 'F') and _c{m} in ('A' 'B' 'C') then do;
     dif_day=_n{n}-_n{m};
	 if dif_day le 365  then do;new_var=1;yes=1;leave;end;
	end;
	if yes then leave;
   end;
 end;
 keep pid new_var;
 run;

View solution in original post

Ksharp · Posted 04-22-2019 10:40 AM

Should be

5	0
6	1

?

How about this one ?

data work.test;
length pid $1.;
length diag $1.;
length diag_dt 8.; format diag_dt yymmdd10.;
infile datalines ;
input pid $ diag $ diag_dt date11.;
format diag_dt date11.;
datalines;
1 A 10JAN2013
1 D 10JAN2013 /*condition met because diagnosis D occurs on same date as A*/
1 G 10JAN2013
2 F 02JUN2017 
2 J 02JUN2017
2 B 01JUL2017 /*condition NOT met because diagnosis F occurs before B*/
2 J 01JUL2017
3 D 23OCT2014
3 C 17DEC2014
3 X 11JAN2015
3 D 25FEB2015 /*condition met because D occurs within 365 days after C*/
4 A 12MAR2015
4 Z 12MAR2015
4 A 20FEB2016 
4 E 07JAN2016 /*condition met because E occurs within 365 days after A on 12MAR2015*/
5 C 16JAN2017
5 A 28JAN2017
5 F 01FEB2018 /*condition NOT met because F occurs after 365 days from C and A*/
6 B 13NOV2016
6 B 10DEC2016
6 D 10DEC2016 /*condition met because D occurs within 365 days of B on 13NOV2016 or 10DEC2016*/
6 B 12DEC2017
7 A 01JAN2013
7 A 08JAN2014
7 D 07MAY2014 /*condition met because D occurs within 365 days of A on 08JAN2014 (but not A on 01JAN2013*/
7 A 15JUN2015
;
run;
proc sort data=test out=have;
by pid diag_dt;
run;
data want;
 array _c{99} $ 32 _temporary_;
 array _n{99} _temporary_;
 do i=1 by 1 until(last.pid);
  set have;
  by pid;
  _c{i}=diag;
  _n{i}=diag_dt;
 end;
new_var=0;
 do m=1 to i-1;
   do n=m+1 to i;
    if _c{n} in ('D' 'E' 'F') and _c{m} in ('A' 'B' 'C') then do;
     dif_day=_n{n}-_n{m};
	 if dif_day le 365  then do;new_var=1;yes=1;leave;end;
	end;
	if yes then leave;
   end;
 end;
 keep pid new_var;
 run;

wj2 · Posted 04-23-2019 02:18 PM

@Ksharp, this is perfect! Thank you!

Counting groups of variables within time frame

Re: Counting groups of variables within time frame

Re: Counting groups of variables within time frame

Re: Counting groups of variables within time frame

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away