Hi everyone,
I've done quite a bit of work in SAS with dates and drug overlap, but this one has me stumped. I am looking at combinations of drugs. Each combination includes a main drug, M, and then there are three other drugs - A, E, and R - that can be combined. Each person will be on drug M the entire time, but the additional drugs can be combined in many different ways.
Currently, the data is setup so that I have several rows per patient. Each row lists the drug regimen along with its start and end dates. However, some of them overlap (e.g. M+E will overlap partially with M+E+R, but there could be unique M+E time prior to/after the M+E+R time). What I want are the order and lengths of time that patients were on unique drug regimens. A picture is probably helpful here:
(Note: I don't have the boundaries demarcated by the black dotted lines in the data, I just put them in that part of the diagram to make it easier to visualize).
Initially, I tried putting the regimens in order of start date (and end date so that if they started on the same date, the one that ended first would come first) and used lag to look at when the previous segment started/ended. I then reversed it and used lag to find when the next regimen started/ended. However, that didn't give me all of the info I needed. For the example I posted above, if I'm just looking at previous and next drug, it would look like there is a unique period of M+E after the M+E+R period ends. However, I need to look a bit further down to discover that there is no unique M+E time because drug A is added to the mix.
It seems like every time I look at another pattern in the data, another complicated situation comes up. For example, someone could take M+E alone for 2 months, then add A for 2 months, go back down to just M+E for another month, add R for 2 months, and then go back to M+E. They would have 5 distinct periods: M+E, M+E+A, M+E, M+E+R, and M+E.
That's another complication - each patient could have anywhere from 1 period (e.g. they only took M+R for 3 months and that's it, easy) up to...I think the most I've seen looking through it is 7, but I haven't done the math to determine the absolute maximum.
A coworker suggested creating an array to look at what drug is being used at any given time. However, this data spans 2006-2014, and the person I'm working with is interested in knowing how many days they were on each regimen. That would be over 3,000 days with over 7,000 patients in the dataset! I suppose if that's the best way to do it, I can, but it would be great to find another way. Also, there is a very wide variety in the amount of time people spend on these drugs. The minimum is 28 days and the maximum is several years.
Here's a fake dataset I made to illustrate some of the trickier cases that I've come across:
data drugs; infile datalines; input id:$1. regimen:$7. start:MMDDYY10. end:MMDDYY10.; format start MMDDYY10. end MMDDYY10.; datalines; A M+E 6/1/2010 11/15/2010 A M+E+R 6/1/2010 9/1/2010 A M+E+R+A 8/1/2010 9/1/2010 A M+A 8/1/2010 12/15/2010 A M+E+A 8/1/2010 11/15/2010 B M+E 1/1/2012 9/1/2012 B M+E+R 3/1/2012 7/1/2012 C M+E+R 1/1/2011 4/1/2011 C M+E 1/1/2011 9/1/2011 C M+E+A 6/1/2011 7/1/2011 D M+R 8/1/2012 10/1/2012 D M+R+E 8/3/2012 10/1/2012 D M+E 8/3/2012 12/15/2012 D M+E+A 10/15/2012 12/1/2012 E M+E 5/1/2011 3/1/2012 E M+E+R 5/15/2011 3/1/2012 E M+R 5/15/2011 5/1/2012 E M+E+R+A 8/1/2011 9/1/2011 ;
If anyone has any advice, I would really appreciate it. Please let me know if there is anything I can clarify.
The array solution might be efficient enough if you are not doing this transformation on a routine basis. Here is how it can be done in a single data step (plus a little preparation) :
/* get the array bounds */
proc sql;
select min(start)-1, max(end)+1 into :start, :end
from drugs;
quit;
data want;
array _d{&start.:&end.} $7 _temporary_;
array _n{&start.:&end.} _temporary_;
call missing(of _d{*});
call missing(of _n{*});
do until(last.id);
    set drugs; by id;
    nbDrugs = 1 + countc(regimen, "+");
    do d = start to end;
        if nbDrugs > _n{d} then do;
            _d{d} = regimen;
            _n{d} = nbDrugs;
            end;
        end;
    end;
call missing(start);
do d = lbound(_d)+1 to hbound(_d);
    if _d{d} ne _d{d-1} then do;
        if not missing(start) then do;
            end = d-1;
            regimen = _d{d-1};
            output;
            end;
        if not missing(_d{d}) then start = d;
        else call missing(start);
        end;
    end;
keep id regimen start end;
run;id regimen start end A M+E+R 2010-06-01 2010-07-31 A M+E+R+A 2010-08-01 2010-09-01 A M+E+A 2010-09-02 2010-11-15 A M+A 2010-11-16 2010-12-15 B M+E 2012-01-01 2012-02-29 B M+E+R 2012-03-01 2012-07-01 B M+E 2012-07-02 2012-09-01 C M+E+R 2011-01-01 2011-04-01 C M+E 2011-04-02 2011-05-31 C M+E+A 2011-06-01 2011-07-01 C M+E 2011-07-02 2011-09-01 D M+R 2012-08-01 2012-08-02 D M+R+E 2012-08-03 2012-10-01 D M+E 2012-10-02 2012-10-14 D M+E+A 2012-10-15 2012-12-01 D M+E 2012-12-02 2012-12-15 E M+E 2011-05-01 2011-05-14 E M+E+R 2011-05-15 2011-07-31 E M+E+R+A 2011-08-01 2011-09-01 E M+E+R 2011-09-02 2012-03-01 E M+R 2012-03-02 2012-05-01
data drugs;
infile datalines;
input id:$1. regimen:$7. start:MMDDYY10. end:MMDDYY10.;
format start MMDDYY10. end day MMDDYY10.;
do day = start to end;
output;
end;
datalines;
A M+E 6/1/2010 11/15/2010
A M+E+R 6/1/2010 9/1/2010
A M+E+R+A 8/1/2010 9/1/2010
A M+A 8/1/2010 12/15/2010
A M+E+A 8/1/2010 11/15/2010
B M+E 1/1/2012 9/1/2012
B M+E+R 3/1/2012 7/1/2012
C M+E+R 1/1/2011 4/1/2011
C M+E 1/1/2011 9/1/2011
C M+E+A 6/1/2011 7/1/2011
D M+R 8/1/2012 10/1/2012
D M+R+E 8/3/2012 10/1/2012
D M+E 8/3/2012 12/15/2012
D M+E+A 10/15/2012 12/1/2012
E M+E 5/1/2011 3/1/2012
E M+E+R 5/15/2011 3/1/2012
E M+R 5/15/2011 5/1/2012
E M+E+R+A 8/1/2011 9/1/2011
;;;
run;
proc means data=drugs nway missing noprint; 
class id day;
var day;
output out=test n=;
run;
proc sort data=drugs out=drugs2 nodupkey;
by id regimen start end;
run;
data test2;
set test;
if _freq_=1;
run;
proc sql;
create table test3 as
select a.*,b.day
from drugs2(drop=day) a,
test2 b
where a.id=b.id and
b.day between a.start and a.end;
quit;
proc means data=test3 nway missing noprint; 
class id regimen start end;
var day;
output out=test4 min(day) = uniq_str max(day)=uniq_end;
run;data drugs;
infile datalines;
input id:$1. regimen:$7. start:MMDDYY10. end:MMDDYY10.;
format start MMDDYY10. end day MMDDYY10.;
do day = start to end;
 output;
end;
datalines;
A	M+E	6/1/2010	11/15/2010
A	M+E+R	6/1/2010	9/1/2010
A	M+E+R+A	8/1/2010	9/1/2010
A	M+A	8/1/2010	12/15/2010
A	M+E+A	8/1/2010	11/15/2010
B	M+E	1/1/2012	9/1/2012
B	M+E+R	3/1/2012	7/1/2012
C	M+E+R	1/1/2011	4/1/2011
C	M+E	1/1/2011	9/1/2011
C	M+E+A	6/1/2011	7/1/2011
D	M+R	8/1/2012	10/1/2012
D	M+R+E	8/3/2012	10/1/2012
D	M+E	8/3/2012	12/15/2012
D	M+E+A	10/15/2012	12/1/2012
E	M+E	5/1/2011	3/1/2012
E	M+E+R	5/15/2011	3/1/2012
E	M+R	5/15/2011	5/1/2012
E	M+E+R+A	8/1/2011	9/1/2011
;;;
run;
proc means data=drugs nway missing noprint; 
class id day;
var day;
output out=test n=;
run;
proc sort data=drugs out=drugs2 nodupkey;
by id regimen start end;
run;
data test2;
set test;
if _freq_=1;
run;
proc sql;
create table test3 as
select a.*,b.day
from drugs2(drop=day) a,
test2 b
where a.id=b.id and
 b.day between a.start and a.end;
quit;
proc means data=test3 nway missing noprint; 
class id regimen start end;
var day;
output out=test4 min(day) = uniq_str max(day)=uniq_end;
run;The array solution might be efficient enough if you are not doing this transformation on a routine basis. Here is how it can be done in a single data step (plus a little preparation) :
/* get the array bounds */
proc sql;
select min(start)-1, max(end)+1 into :start, :end
from drugs;
quit;
data want;
array _d{&start.:&end.} $7 _temporary_;
array _n{&start.:&end.} _temporary_;
call missing(of _d{*});
call missing(of _n{*});
do until(last.id);
    set drugs; by id;
    nbDrugs = 1 + countc(regimen, "+");
    do d = start to end;
        if nbDrugs > _n{d} then do;
            _d{d} = regimen;
            _n{d} = nbDrugs;
            end;
        end;
    end;
call missing(start);
do d = lbound(_d)+1 to hbound(_d);
    if _d{d} ne _d{d-1} then do;
        if not missing(start) then do;
            end = d-1;
            regimen = _d{d-1};
            output;
            end;
        if not missing(_d{d}) then start = d;
        else call missing(start);
        end;
    end;
keep id regimen start end;
run;id regimen start end A M+E+R 2010-06-01 2010-07-31 A M+E+R+A 2010-08-01 2010-09-01 A M+E+A 2010-09-02 2010-11-15 A M+A 2010-11-16 2010-12-15 B M+E 2012-01-01 2012-02-29 B M+E+R 2012-03-01 2012-07-01 B M+E 2012-07-02 2012-09-01 C M+E+R 2011-01-01 2011-04-01 C M+E 2011-04-02 2011-05-31 C M+E+A 2011-06-01 2011-07-01 C M+E 2011-07-02 2011-09-01 D M+R 2012-08-01 2012-08-02 D M+R+E 2012-08-03 2012-10-01 D M+E 2012-10-02 2012-10-14 D M+E+A 2012-10-15 2012-12-01 D M+E 2012-12-02 2012-12-15 E M+E 2011-05-01 2011-05-14 E M+E+R 2011-05-15 2011-07-31 E M+E+R+A 2011-08-01 2011-09-01 E M+E+R 2011-09-02 2012-03-01 E M+R 2012-03-02 2012-05-01
Before I saw @PGStats 's solution, I made some adaptations to what @smantha suggested. This is what I got to work, but @PGStats gave a much more elegant solution!
proc means data=drugs nway missing noprint; 
class id day;
var day;
output out=test n=;
run;
proc sort data=drugs out=drugs2 nodupkey;
by id regimen start end;
run;
*figure out when the number of regimens switches;
data test2;
set test;
by id;
lagfreq = lag(_freq_);
if first.id then lagfreq = .;
if _freq_ = lagfreq then delete;
run;
proc sql;
create table test3 as
select a.*, b.regimen
from test2 a
left join drugs b
on (a.id=b.id) and (a.day=b.day);
quit;
*if a given day has multiple combinations of drugs, take the most complicated combo;
data test3;
set test3;
if regimen in ("M+E", "M+A", "M+R") then combo_n = 2;
if regimen in ("M+E+R", "M+E+A", "M+R+A") then combo_n = 3;
if regimen = "M+E+R+A" then combo_n = 4;
run;
proc sql;
create table test4 as
select *, max(combo_n) as maxcombo
from test3
group by id, day
having combo_n = maxcombo;
quit;
*determine when the next regimen started;
proc sort data=test4; by id descending day; run;
data test5;
set test4;
format nextdate MMDDYY10.;
by id;
nextdate=lag(day);
if first.id then nextdate=.;
run;
proc sort data=test5; by id day; run;
*get the final end date for each person;
proc sql;
create table ends as
select id, max(end) as final format = MMDDYY10.
from drugs2
group by id
having end=final;
quit;
proc sql;
create table regimens as
select a.id, a.regimen, a.day, a.nextdate, b.final
from test5 a
left join ends b
on (a.id=b.id);
quit;
data regimens;
set regimens;
format start end MMDDYY10.;
start = day;
if nextdate = . then end = final;
else end = nextdate;
regimentime = end - start;
drop day nextdate final;
run;Thanks to both of you for your help!
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
