Solved: Re: Counting large groups of variables within time frame

wj2 · Posted 11-21-2019 11:10 PM

Hello Community,

I am working with a large electronic health record data set (~1 million patients) and I need to create a variable that indicates whether patients have ANY diagnosis in a group of diagnoses (e.g., A, B, or C) within 365 days (before or after) the date of ANY diagnosis within a different group of diagnoses (e.g., D, E, or F).

In other words: (diagnosis date of diagnosis A or B or C) – (diagnosis date of D or E or F) = le 365 days or ge 365 days. The condition can be met for any pair of diagnoses between the two groups (e.g., D-A, D-B, D-C, E-A, etc.).

Another critical issue is that one of the groups of diagnoses that I am interested in contains approximately 2,500 different diagnoses and corresponding diagnosis dates. Therefore, I need the program to be able to somehow incorporate this large amount of data in an efficient manner in order to limit CPU resources (maybe via creating a separate data set for these diagnoses to read in?).

Below is a simplified example data set with some different cases for when the criterion should and should not be met. patient id = pid; diagnosis code= diag_cd; diagnosis date=diag_dt.

data work.test;
length pid $1.;
length diag $1.;
length diag_dt 8.; format diag_dt yymmdd10.;
infile datalines ;
input pid $ diag $ diag_dt date11.;
datalines;
1 A 10JAN2013
1 D 10JAN2013 /*condition met because diagnosis D occurs on same date as A*/
1 G 10JAN2013
2 F 02JUN2017 
2 J 02JUN2017
2 B 01JUL2017 /*condition met because diagnosis dates of F and B are within 365 days*/
2 J 01JUL2017
3 E 23OCT2012
3 C 17DEC2014
3 X 11JAN2015
3 D 25FEB2015 /*condition met because D occurs within 365 days of C*/
4 A 12MAR2015
4 Z 12MAR2015
4 A 20FEB2016 
4 E 07JAN2016 /*condition met because E occurs within 365 days of A on 12MAR2015*/
5 C 16JAN2017
5 A 28JAN2017
5 F 01FEB2018 /*condition NOT met because F occurs after 365 days from C and A*/
6 B 13NOV2016
6 B 10DEC2016
6 D 10DEC2016 /*condition met because D occurs within 365 days of B on 13NOV2016 or 10DEC2016*/
6 B 12DEC2017
7 F 01JAN2013
7 D 08JUN2013
7 C 07OCT2014 /*condition not met because C occurs after 365 days from F and D*/
;
run;

Ultimately, I would like the output data set to have a single row per each patient with a new variable (e.g., newvar) indicating whether the condition is met (diagnosis date of A or B or C - diagnosis date of D or E or F = le/ge 365 days). Below is an example based on the example data set above.

pid	newvar
1	1
2	1
3	1
4	1
5	0
6	1
7	0

Any help with this problem would be much appreciated. Please let me know if I can provide any additional clarification or information. Thank you!

ErikLund_Jensen · Posted 11-26-2019 02:37 PM

Hi @wj2

This can be done with a simple data step, where each diagnosis_date in a group (ABC) or DEF) is compared to the last previous diagnosis_date in the other group. No need for preprocessing, lag functions or hash.

The only reguirement is that the input data set is sorted on pid and diag_dt, so I have included that in the code as a precaution in case your real data set is unsorted.

I only one group is present for a given pid, newvar will be set to 0. I hope this vill solve your problem.

proc sort data=test;
        by pid diag_dt;
run;

data want (keep=pid newvar); 
	set test;
	by pid;
	retain lastABC lastDEF newvar;

	if first.pid then do;
		lastABC = 0; lastDEF = 0; newvar = 0;
	end;

	if diag in ('A','B','C') then do;
		if lastDEF > 0 then newvar = sum(newvar,(diag_dt-lastDEF) <= 365);
		lastABC = diag_dt;
	end;

	else if diag in ('D','E','F') then do;
		if lastABC > 0 then newvar = sum(newvar,(diag_dt-lastABC) <= 365);
		lastDEF = diag_dt;
	end;

	if last.pid then do;
		newvar = (newvar > 0);
		output;
	end;
run;

View solution in original post

Reeza · Posted 11-21-2019 11:24 PM

If you use SQL to first filter your data set to just those set of diagnosis how many records do you have?

proc sql;
create table simplified as
select * from table1 where diag in (select diag from table2);
quit;

wj2 · Posted 11-26-2019 08:56 AM

@Reeza after filtering, it looks like there are 35,324 records for the first set of diagnoses and 9,576,319 records for the second set of diagnoses. Any suggestions would be much appreciated.

Reeza · Posted 11-26-2019 11:13 AM

You need to post one of each of your input data sets. But honestly this question gets asked a lot on here there's many solutions. Usually the most efficient is a hash solution but I don't code in hash 😞

The second option, which is pretty simple is a SQL join.

It would look something like this assuming you've filtered each data set for what you need. Modify this for what your data and if you have issues let us know.

proc sql;
create table want as
select a.*, b.diag as diag_second, b.date as date_second_event, b.diag_date - a.diag_date as date_diff
from have1 as a
left join have2 as b
on a.patientid=b.patientid

where b.date-a.date < 365
order by a.pid, a.diag_date, b.diag_date;
quit;

You may end up with multiple matches here - ie if you have 2 diagnosis within 365 days it will list all three.

wj2 · Posted 11-27-2019 09:40 AM

@Reeza thank you! In the past I have had disk space problems with hash solutions so I appreciate the SQL approach.

I would actually like to have the newvar indicate whether the condition is met among the entire sample. That is, newvar=0 for those where the condition is not met. Would you mind suggesting how the code could be modified for that? Also, how would I indicate the two sets of diagnoses (e.g., set 1 containing diagnoses A, B, and C and set 2 containing diagnoses D, E, and F) that I am interested in finding the difference between diagnosis dates? Would an INTCK function be necessary to handle negative values in the difference between dates? Thanks again!

ErikLund_Jensen · Posted 11-27-2019 09:46 AM

Hi @wj2

You might give my code a try. It is intended to be very efficient (much more than any SQL approach with between-conditions) and return your wanted-data set in the final form with 1/0 covering the entire sample. It should work on your test data without modifications.

wj2 · Posted 11-27-2019 09:51 AM

Hi @ErikLund_Jensen Yes, thank you! Your approach appears to be exactly what I need but I will test it to be sure today. As a new SAS learner, I am just interested in learning multiple approaches so I was wondering about a SQL approach as well!

mkeintz · Posted 11-22-2019 01:05 AM

I can imagine a couple ways to implement the tests you want to do, but what to you want your output data set to look like?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

wj2 · Posted 11-22-2019 08:11 AM

@mkeintz I would love to hear your thoughts. I would like the output data set to have a single row per each patient with a new variable (e.g., newvar) indicating whether the condition is met (diagnosis date of A or B or C - diagnosis date of D or E or F = le/ge 365 days) for every patient in the dataset. Below is an example based on the example data set above.

pid	newvar
1	1
2	1
3	1
4	1
5	0
6	1
7	0

ErikLund_Jensen · Posted 11-26-2019 02:37 PM

Hi @wj2

This can be done with a simple data step, where each diagnosis_date in a group (ABC) or DEF) is compared to the last previous diagnosis_date in the other group. No need for preprocessing, lag functions or hash.

The only reguirement is that the input data set is sorted on pid and diag_dt, so I have included that in the code as a precaution in case your real data set is unsorted.

I only one group is present for a given pid, newvar will be set to 0. I hope this vill solve your problem.

proc sort data=test;
        by pid diag_dt;
run;

data want (keep=pid newvar); 
	set test;
	by pid;
	retain lastABC lastDEF newvar;

	if first.pid then do;
		lastABC = 0; lastDEF = 0; newvar = 0;
	end;

	if diag in ('A','B','C') then do;
		if lastDEF > 0 then newvar = sum(newvar,(diag_dt-lastDEF) <= 365);
		lastABC = diag_dt;
	end;

	else if diag in ('D','E','F') then do;
		if lastABC > 0 then newvar = sum(newvar,(diag_dt-lastABC) <= 365);
		lastDEF = diag_dt;
	end;

	if last.pid then do;
		newvar = (newvar > 0);
		output;
	end;
run;

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away