BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
m_o
Obsidian | Level 7 m_o
Obsidian | Level 7

Hello,

I am trying to define concomitant use of drugs. 

I have a great solution to my problem with a SAS data step at the link below:

https://communities.sas.com/t5/Base-SAS-Programming/Concomitant-drug-medication-use/m-p/351710#M8188...

However, the only issue is that I have a very large dataset (~600K rows) and those codes require a ton of memory and my computer crashes. I wonder if anybody could suggest an alternative, perhaps a proc sql solution, that would not require as much memory and computational capacity. 

Thank you very much! Any help would be greatly appreciated!

 

Katie

 

1 ACCEPTED SOLUTION

Accepted Solutions
thomp7050
Pyrite | Level 9

This is not exactly elegant but it works:

 

data have;  
input ID DRUG $ START_DT MMDDYY10. DAYS_SUPP  END_DT MMDDYY10. ;
FORMAT START_DT END_DT MMDDYY10.;
cards; 
1	A	2/17/10	30	3/19/10
1	B	5/6/09	30	6/5/09
1	C	7/9/11	60	9/7/11
1	E	3/1/10	90	5/30/10
1	B	1/1/09	90	4/1/09
1	D	2/1/09	30	3/3/09
1	C	5/6/12	90	8/4/12
2	B	4/1/12	60	5/31/12
2	A	7/1/10	30	7/31/10
2	C	8/3/10	90	11/1/10
2	D	11/1/13	90	1/30/14
2	E	12/5/13	90	3/5/14
2	A	2/1/11	90	5/2/11
;
RUN;

PROC SQL;
CREATE TABLE HAVE AS
SELECT MONOTONIC() AS ROWID, * FROM HAVE;
QUIT;

%let fromDate = 01Jan2000;
%let toDate = 31DEC2016;
data alldates;
  length key 8; 
  do date = "&fromDate"d to "&toDate"d; 
    key + 1; 
    RSA_WorkdayInd = ( 2 <= weekday(date) <= 6);
    WeekendInd = (not RSA_WorkdayInd);
    Calendar_Week_Number = week(date, "V");
    DayOfWeek = put(date, downame3.);
    Fin_Year = year(intnx("year.7", date, 0));
    output; 
  end; 

  format 
    date date9.
  ;
run; 

PROC SQL;
CREATE TABLE OVERLAP AS
SELECT HAVE.*, HAVE2.ROWID AS ROWID2, HAVE2.ID AS ID2, HAVE2.START_DT AS START2, HAVE2.DAYS_SUPP AS DAYS2, HAVE2.END_DT AS END2 
, (SELECT COUNT(*) FROM alldates WHERE DATE >= HAVE.START_DT AND DATE >= HAVE2.START_DT AND DATE <= HAVE.END_DT AND DATE <= HAVE2.END_DT) AS OVERLAP
FROM HAVE, HAVE AS HAVE2
WHERE HAVE.ROWID NE HAVE2.ROWID
AND HAVE.ID = HAVE2.ID
AND (HAVE.START_DT <= HAVE2.END_DT) and (HAVE2.START_DT <= HAVE.END_DT);
QUIT;

View solution in original post

12 REPLIES 12
Reeza
Super User

600K rows should not be causing issues. 

Can you identify which step causes the issues?

m_o
Obsidian | Level 7 m_o
Obsidian | Level 7
The first step. Thank you.
PGStats
Opal | Level 21

The first step can be slightly optimized as:

 

data days(sortedby=id);
length drug $1;
set have;
do day = start_dt to end_dt;
    output;
    end;
format day yymmdd10.;
keep id drug day;
run;

The sortedby= dataset option might improve the sort performance of the following step.

Droping the useless variable dum saves space. (It was a remnant of a previous version of the code)

Defining the length of drug as 1 character will also save space.

 

See if you can get the first step and sort to work. If so, we will try to optimize the rest.

PG
m_o
Obsidian | Level 7 m_o
Obsidian | Level 7
Thank you very much for your response. I tried as you suggested but my SAS stopped working after processing about a billion rows. I might go the proc sql route instead. Thanks again!
Astounding
PROC Star

Regarding your link ...

 

The first step is a good idea at least in theory.  But it does expand the data set to perhaps 40M rows.  Is that where the process breaks down?

 

The second step is also a good idea (sorting the 40M rows).  Again, is this the source of the problem?

 

I'm hoping the problems lie in the DATA step that follows sorting.  Questions for that:

 

  • Are the drug names really A, B, C, D, E?  (If not, how many characters are required for each name?)
  • Do you know the maximum number of drugs per patient?  (It might be 5, but I just wanted to check.)
  • If there is a combination of AB for 10 days, then CD for the next 20 days, does that still count as a block of 30 days with 2+ drugs?  (If not, a little more explanation would be in order.)

 

 

m_o
Obsidian | Level 7 m_o
Obsidian | Level 7

Thank you for your reply. Yes, the process breaks down at the first step due to the large number of rows it tries to generate. Regarding your second point, if a patient has been on combination AB for 10 days and combination CD for 20 days, then I would disregard those combinations because in both cases, the patient has been on a unique drug combination for <30 days. However, if a patient has been on AB 10 days but ABC for >=30 days, then I will count ABC.

I am not sure about the maximum number of overlaps. Probably about 5-6. I will shorten the names of the drugs to single letters or single digits in place of the full drug name. Thank you very much!

Reeza
Super User

I still don't understand how it would crash. 40 million rows shouldn't be too much, it's annoying and large - takes about <20 mins to process analytically our sites on a standard desktop. If its failing at the sort, you could change the sortsize option to help with that issue.

thomp7050
Pyrite | Level 9

For your consideration, proc sql code that displays number of days overlap among single-subject drug periods that overlap:

 

data have;  
input ID DRUG $ START_DT MMDDYY10. DAYS_SUPP  END_DT MMDDYY10. ;
FORMAT START_DT END_DT MMDDYY10.;
cards; 
1	A	2/17/10	30	3/19/10
1	B	5/6/09	30	6/5/09
1	C	7/9/11	60	9/7/11
1	E	3/1/10	90	5/30/10
1	B	1/1/09	90	4/1/09
1	D	2/1/09	30	3/3/09
1	C	5/6/12	90	8/4/12
2	B	4/1/12	60	5/31/12
2	A	7/1/10	30	7/31/10
2	C	8/3/10	90	11/1/10
2	D	11/1/13	90	1/30/14
2	E	12/5/13	90	3/5/14
2	A	2/1/11	90	5/2/11
;
RUN;

PROC SQL;
CREATE TABLE HAVE AS
SELECT MONOTONIC() AS ROWID, * FROM HAVE;
QUIT;

PROC SQL;
CREATE TABLE OVERLAP AS
SELECT HAVE.*, HAVE2.ROWID AS ROWID2, HAVE2.ID AS ID2, HAVE2.START_DT AS START2, HAVE2.DAYS_SUPP AS DAYS2, HAVE2.END_DT AS END2 
, CASE WHEN HAVE.START_DT > HAVE2.START_DT THEN HAVE2.END_DT - HAVE.START_DT  ELSE  HAVE.END_DT -  HAVE2.START_DT END AS OVERLAP
FROM HAVE, HAVE AS HAVE2
WHERE HAVE.ROWID NE HAVE2.ROWID
AND HAVE.ID = HAVE2.ID
AND (HAVE.START_DT <= HAVE2.END_DT) and (HAVE2.START_DT <= HAVE.END_DT);
QUIT;
m_o
Obsidian | Level 7 m_o
Obsidian | Level 7

Thank you for your response. This is very helpful. I noticed an error in your output in that, in rows 3 and 4 of the output, the actual overlap duration between the 2 drugs are between 2/1/2009 and 3/3/2009, which is 30 days, not 59 days. I was wondering how I could use CASE WHEN to take care of the other two scenarios where duration of drugA can be completely within duration of drugB and vice versa.

thomp7050
Pyrite | Level 9

This is not exactly elegant but it works:

 

data have;  
input ID DRUG $ START_DT MMDDYY10. DAYS_SUPP  END_DT MMDDYY10. ;
FORMAT START_DT END_DT MMDDYY10.;
cards; 
1	A	2/17/10	30	3/19/10
1	B	5/6/09	30	6/5/09
1	C	7/9/11	60	9/7/11
1	E	3/1/10	90	5/30/10
1	B	1/1/09	90	4/1/09
1	D	2/1/09	30	3/3/09
1	C	5/6/12	90	8/4/12
2	B	4/1/12	60	5/31/12
2	A	7/1/10	30	7/31/10
2	C	8/3/10	90	11/1/10
2	D	11/1/13	90	1/30/14
2	E	12/5/13	90	3/5/14
2	A	2/1/11	90	5/2/11
;
RUN;

PROC SQL;
CREATE TABLE HAVE AS
SELECT MONOTONIC() AS ROWID, * FROM HAVE;
QUIT;

%let fromDate = 01Jan2000;
%let toDate = 31DEC2016;
data alldates;
  length key 8; 
  do date = "&fromDate"d to "&toDate"d; 
    key + 1; 
    RSA_WorkdayInd = ( 2 <= weekday(date) <= 6);
    WeekendInd = (not RSA_WorkdayInd);
    Calendar_Week_Number = week(date, "V");
    DayOfWeek = put(date, downame3.);
    Fin_Year = year(intnx("year.7", date, 0));
    output; 
  end; 

  format 
    date date9.
  ;
run; 

PROC SQL;
CREATE TABLE OVERLAP AS
SELECT HAVE.*, HAVE2.ROWID AS ROWID2, HAVE2.ID AS ID2, HAVE2.START_DT AS START2, HAVE2.DAYS_SUPP AS DAYS2, HAVE2.END_DT AS END2 
, (SELECT COUNT(*) FROM alldates WHERE DATE >= HAVE.START_DT AND DATE >= HAVE2.START_DT AND DATE <= HAVE.END_DT AND DATE <= HAVE2.END_DT) AS OVERLAP
FROM HAVE, HAVE AS HAVE2
WHERE HAVE.ROWID NE HAVE2.ROWID
AND HAVE.ID = HAVE2.ID
AND (HAVE.START_DT <= HAVE2.END_DT) and (HAVE2.START_DT <= HAVE.END_DT);
QUIT;
m_o
Obsidian | Level 7 m_o
Obsidian | Level 7
Thank you! This works! I will try to modify the sql code to also identify three or more overlaps, but this is wonderful. Thank you very much!! 🙂
m_o
Obsidian | Level 7 m_o
Obsidian | Level 7

Thank you very much for your help once again. I was wondering if there is a way to modify the sql code so that it outputs 2 additional columns containing the 2 dates that were used to compute that difference/overlapping days. I can do it for two combinations in a simple separate data step, but I found that it gets quite hairy when considering 3-5 combinations. Thank you very much!

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 6004 views
  • 2 likes
  • 5 in conversation