Hello,
I am trying to define concomitant use of drugs.
I have a great solution to my problem with a SAS data step at the link below:
However, the only issue is that I have a very large dataset (~600K rows) and those codes require a ton of memory and my computer crashes. I wonder if anybody could suggest an alternative, perhaps a proc sql solution, that would not require as much memory and computational capacity.
Thank you very much! Any help would be greatly appreciated!
Katie
This is not exactly elegant but it works:
data have;
input ID DRUG $ START_DT MMDDYY10. DAYS_SUPP END_DT MMDDYY10. ;
FORMAT START_DT END_DT MMDDYY10.;
cards;
1 A 2/17/10 30 3/19/10
1 B 5/6/09 30 6/5/09
1 C 7/9/11 60 9/7/11
1 E 3/1/10 90 5/30/10
1 B 1/1/09 90 4/1/09
1 D 2/1/09 30 3/3/09
1 C 5/6/12 90 8/4/12
2 B 4/1/12 60 5/31/12
2 A 7/1/10 30 7/31/10
2 C 8/3/10 90 11/1/10
2 D 11/1/13 90 1/30/14
2 E 12/5/13 90 3/5/14
2 A 2/1/11 90 5/2/11
;
RUN;
PROC SQL;
CREATE TABLE HAVE AS
SELECT MONOTONIC() AS ROWID, * FROM HAVE;
QUIT;
%let fromDate = 01Jan2000;
%let toDate = 31DEC2016;
data alldates;
length key 8;
do date = "&fromDate"d to "&toDate"d;
key + 1;
RSA_WorkdayInd = ( 2 <= weekday(date) <= 6);
WeekendInd = (not RSA_WorkdayInd);
Calendar_Week_Number = week(date, "V");
DayOfWeek = put(date, downame3.);
Fin_Year = year(intnx("year.7", date, 0));
output;
end;
format
date date9.
;
run;
PROC SQL;
CREATE TABLE OVERLAP AS
SELECT HAVE.*, HAVE2.ROWID AS ROWID2, HAVE2.ID AS ID2, HAVE2.START_DT AS START2, HAVE2.DAYS_SUPP AS DAYS2, HAVE2.END_DT AS END2
, (SELECT COUNT(*) FROM alldates WHERE DATE >= HAVE.START_DT AND DATE >= HAVE2.START_DT AND DATE <= HAVE.END_DT AND DATE <= HAVE2.END_DT) AS OVERLAP
FROM HAVE, HAVE AS HAVE2
WHERE HAVE.ROWID NE HAVE2.ROWID
AND HAVE.ID = HAVE2.ID
AND (HAVE.START_DT <= HAVE2.END_DT) and (HAVE2.START_DT <= HAVE.END_DT);
QUIT;
600K rows should not be causing issues.
Can you identify which step causes the issues?
The first step can be slightly optimized as:
data days(sortedby=id);
length drug $1;
set have;
do day = start_dt to end_dt;
output;
end;
format day yymmdd10.;
keep id drug day;
run;
The sortedby= dataset option might improve the sort performance of the following step.
Droping the useless variable dum saves space. (It was a remnant of a previous version of the code)
Defining the length of drug as 1 character will also save space.
See if you can get the first step and sort to work. If so, we will try to optimize the rest.
Regarding your link ...
The first step is a good idea at least in theory. But it does expand the data set to perhaps 40M rows. Is that where the process breaks down?
The second step is also a good idea (sorting the 40M rows). Again, is this the source of the problem?
I'm hoping the problems lie in the DATA step that follows sorting. Questions for that:
Thank you for your reply. Yes, the process breaks down at the first step due to the large number of rows it tries to generate. Regarding your second point, if a patient has been on combination AB for 10 days and combination CD for 20 days, then I would disregard those combinations because in both cases, the patient has been on a unique drug combination for <30 days. However, if a patient has been on AB 10 days but ABC for >=30 days, then I will count ABC.
I am not sure about the maximum number of overlaps. Probably about 5-6. I will shorten the names of the drugs to single letters or single digits in place of the full drug name. Thank you very much!
I still don't understand how it would crash. 40 million rows shouldn't be too much, it's annoying and large - takes about <20 mins to process analytically our sites on a standard desktop. If its failing at the sort, you could change the sortsize option to help with that issue.
For your consideration, proc sql code that displays number of days overlap among single-subject drug periods that overlap:
data have;
input ID DRUG $ START_DT MMDDYY10. DAYS_SUPP END_DT MMDDYY10. ;
FORMAT START_DT END_DT MMDDYY10.;
cards;
1 A 2/17/10 30 3/19/10
1 B 5/6/09 30 6/5/09
1 C 7/9/11 60 9/7/11
1 E 3/1/10 90 5/30/10
1 B 1/1/09 90 4/1/09
1 D 2/1/09 30 3/3/09
1 C 5/6/12 90 8/4/12
2 B 4/1/12 60 5/31/12
2 A 7/1/10 30 7/31/10
2 C 8/3/10 90 11/1/10
2 D 11/1/13 90 1/30/14
2 E 12/5/13 90 3/5/14
2 A 2/1/11 90 5/2/11
;
RUN;
PROC SQL;
CREATE TABLE HAVE AS
SELECT MONOTONIC() AS ROWID, * FROM HAVE;
QUIT;
PROC SQL;
CREATE TABLE OVERLAP AS
SELECT HAVE.*, HAVE2.ROWID AS ROWID2, HAVE2.ID AS ID2, HAVE2.START_DT AS START2, HAVE2.DAYS_SUPP AS DAYS2, HAVE2.END_DT AS END2
, CASE WHEN HAVE.START_DT > HAVE2.START_DT THEN HAVE2.END_DT - HAVE.START_DT ELSE HAVE.END_DT - HAVE2.START_DT END AS OVERLAP
FROM HAVE, HAVE AS HAVE2
WHERE HAVE.ROWID NE HAVE2.ROWID
AND HAVE.ID = HAVE2.ID
AND (HAVE.START_DT <= HAVE2.END_DT) and (HAVE2.START_DT <= HAVE.END_DT);
QUIT;
Thank you for your response. This is very helpful. I noticed an error in your output in that, in rows 3 and 4 of the output, the actual overlap duration between the 2 drugs are between 2/1/2009 and 3/3/2009, which is 30 days, not 59 days. I was wondering how I could use CASE WHEN to take care of the other two scenarios where duration of drugA can be completely within duration of drugB and vice versa.
This is not exactly elegant but it works:
data have;
input ID DRUG $ START_DT MMDDYY10. DAYS_SUPP END_DT MMDDYY10. ;
FORMAT START_DT END_DT MMDDYY10.;
cards;
1 A 2/17/10 30 3/19/10
1 B 5/6/09 30 6/5/09
1 C 7/9/11 60 9/7/11
1 E 3/1/10 90 5/30/10
1 B 1/1/09 90 4/1/09
1 D 2/1/09 30 3/3/09
1 C 5/6/12 90 8/4/12
2 B 4/1/12 60 5/31/12
2 A 7/1/10 30 7/31/10
2 C 8/3/10 90 11/1/10
2 D 11/1/13 90 1/30/14
2 E 12/5/13 90 3/5/14
2 A 2/1/11 90 5/2/11
;
RUN;
PROC SQL;
CREATE TABLE HAVE AS
SELECT MONOTONIC() AS ROWID, * FROM HAVE;
QUIT;
%let fromDate = 01Jan2000;
%let toDate = 31DEC2016;
data alldates;
length key 8;
do date = "&fromDate"d to "&toDate"d;
key + 1;
RSA_WorkdayInd = ( 2 <= weekday(date) <= 6);
WeekendInd = (not RSA_WorkdayInd);
Calendar_Week_Number = week(date, "V");
DayOfWeek = put(date, downame3.);
Fin_Year = year(intnx("year.7", date, 0));
output;
end;
format
date date9.
;
run;
PROC SQL;
CREATE TABLE OVERLAP AS
SELECT HAVE.*, HAVE2.ROWID AS ROWID2, HAVE2.ID AS ID2, HAVE2.START_DT AS START2, HAVE2.DAYS_SUPP AS DAYS2, HAVE2.END_DT AS END2
, (SELECT COUNT(*) FROM alldates WHERE DATE >= HAVE.START_DT AND DATE >= HAVE2.START_DT AND DATE <= HAVE.END_DT AND DATE <= HAVE2.END_DT) AS OVERLAP
FROM HAVE, HAVE AS HAVE2
WHERE HAVE.ROWID NE HAVE2.ROWID
AND HAVE.ID = HAVE2.ID
AND (HAVE.START_DT <= HAVE2.END_DT) and (HAVE2.START_DT <= HAVE.END_DT);
QUIT;
Thank you very much for your help once again. I was wondering if there is a way to modify the sql code so that it outputs 2 additional columns containing the 2 dates that were used to compute that difference/overlapping days. I can do it for two combinations in a simple separate data step, but I found that it gets quite hairy when considering 3-5 combinations. Thank you very much!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.