Hello everyone,
I have a data like this:
patient_ID drug_dose date_start date_stop
1 10 04/26/2008 11/10/2008
1 5 06/16/2008 02/02/2009
1 7 10/13/2008 11/12/2008
2 3 04/30/2008 10/27/2008
2 11 08/17/2008 05/14/2009
...
where all start/stop dates have closed intervals. Then, I want to have a data like this:
patient_ID drug_dose date_start date_stop
1 10 04/26/2008 06/15/2008
1 15 06/16/2008 10/12/2008
1 22 10/13/2008 11/10/2008
1 12 11/11/2008 11/12/2008
1 5 11/13/2008 02/02/2009
2 3 04/30/2008 08/16/2008
2 14 08/17/2008 10/27/2008
2 11 10/28/2008 05/14/2009
...
In other words, I want to obtain the whole drug dosage for every time interval for each patient.
I'd be very thankful if anyone has any idea/suggestion!
OK then, the data set is large, but 3 years is not that many days. So here goes an array solution:
data have;
input patient_ID drug_dose (date_start date_stop) (:mmddyy10.);
format date_start date_stop yymmdd10.;
datalines;
1 10 04/26/2008 11/10/2008
1 5 06/16/2008 02/02/2009
1 7 10/13/2008 11/12/2008
2 3 04/30/2008 10/27/2008
2 11 08/17/2008 05/14/2009
;
data want;
/* Array subscripts should start before earliest date_start and
end after latest date_stop */
%let dstart=%sysfunc(mdy(12,31,2007));
%let dend=%sysfunc(mdy(1,1,2010));
array d{&dstart. : &dend.};
do until(last.patient_id);
set have; by patient_id;
do date = date_start to date_stop;
d{date} = sum(d{date}, drug_dose);
end;
end;
do date = &dstart.+1 to &dend.;
if d{date} ne d{date-1} then do;
if not missing(date_start) then do;
date_stop = date-1;
drug_dose = d{date-1};
if drug_dose > 0 then output;
call missing(date_start, drug_dose);
end;
date_start = date;
end;
end;
format date yymmdd10.;
keep patient_id date_start date_stop drug_dose;
run;
If your data isn't too huge, you can simply expand and sum, like this:
data have;
input patient_ID drug_dose (date_start date_stop) (:mmddyy10.);
format date_start date_stop yymmdd10.;
datalines;
1 10 04/26/2008 11/10/2008
1 5 06/16/2008 02/02/2009
1 7 10/13/2008 11/12/2008
2 3 04/30/2008 10/27/2008
2 11 08/17/2008 05/14/2009
;
data expand;
set have;
do date = date_start to date_stop;
output;
end;
format date yymmdd10.;
keep patient_id date drug_dose;
run;
proc sql;
create table sumDoses as
select patient_id, date, sum(drug_dose) as drug_dose
from expand
group by patient_id, date
order by patient_id, date;
quit;
data want;
set sumDoses;
by patient_id drug_dose notsorted;
retain date_start;
if first.drug_dose then date_start = date;
if last.drug_dose and drug_dose > 0 then do;
date_stop = date;
output;
end;
format date_start date_stop yymmdd10.;
keep patient_id drug_dose date_start date_stop;
run;
Otherwise, if your time period isn't too long, you could use an array (one element per day).
@PGStats Thank you very much! However, based on your comments, my data has 2 problems: I have >= 40 million rows, and the time period is 3 years. Any recommendation?
OK then, the data set is large, but 3 years is not that many days. So here goes an array solution:
data have;
input patient_ID drug_dose (date_start date_stop) (:mmddyy10.);
format date_start date_stop yymmdd10.;
datalines;
1 10 04/26/2008 11/10/2008
1 5 06/16/2008 02/02/2009
1 7 10/13/2008 11/12/2008
2 3 04/30/2008 10/27/2008
2 11 08/17/2008 05/14/2009
;
data want;
/* Array subscripts should start before earliest date_start and
end after latest date_stop */
%let dstart=%sysfunc(mdy(12,31,2007));
%let dend=%sysfunc(mdy(1,1,2010));
array d{&dstart. : &dend.};
do until(last.patient_id);
set have; by patient_id;
do date = date_start to date_stop;
d{date} = sum(d{date}, drug_dose);
end;
end;
do date = &dstart.+1 to &dend.;
if d{date} ne d{date-1} then do;
if not missing(date_start) then do;
date_stop = date-1;
drug_dose = d{date-1};
if drug_dose > 0 then output;
call missing(date_start, drug_dose);
end;
date_start = date;
end;
end;
format date yymmdd10.;
keep patient_id date_start date_stop drug_dose;
run;
@PGStats Thank you very much again!
When I run your code for the period from 12/31/2007 to 1/7/2011, it only gave the results for date_start = '01/01/2008' and date_stop = "01/02/2008'.
It also gave me this error i nthe log window:
ERROR: Array subscript out of range at line 317 column 23.
--> line 317 was related to this line of code: d{date} = sum(d{date}, MME_day);
d1=. d2=. d3=. d4=. d5=. .....d1099=0 d1100=0 d1101=0 d1102=0 d1103=0 d1104=0
last.ENROLID=1 ENROLID=41102 MME_day=0
date_start=08/10/2010 date_stop=02/06/2011
FIRST.ENROLID=0 date=01/08/2011 _ERROR_=1 _N_=7
Given that you have a date_stop of 2011-02-06, I think you meant the array limits to be
%let dstart=%sysfunc(mdy(12,31,2007));
%let dend=%sysfunc(mdy(7,1,2011));
Make sure these date limits encompass all of your data.
@PGStats Thank you very much! It resolved my problem.
If you don't mind, I have another question: If I have another drug (lets call it drug 2) with different dose (i.e., drug2_dose), how the code above will be changed?
How would you represent cases where a start-stop pair applies to one drug and not the other in your data structure?
Extension to two drugs is pretty straitforward. The data could also include zeros for drug1_dose or drug2_dose if only one was given over a certain period.
data have;
input patient_ID drug1_dose drug2_dose (date_start date_stop) (:mmddyy10.);
format date_start date_stop yymmdd10.;
datalines;
1 10 1 04/26/2008 11/10/2008
1 5 15 06/16/2008 02/02/2009
1 7 17 10/13/2008 11/12/2008
2 3 13 04/30/2008 10/27/2008
2 11 1 08/17/2008 05/14/2009
;
data want;
/* Array subscripts should start before earliest date_start and
end after latest date_stop */
%let dstart=%sysfunc(mdy(12,31,2007));
%let dend=%sysfunc(mdy(7,1,2011));
array d1{&dstart. : &dend.};
array d2{&dstart. : &dend.};
do until(last.patient_id);
set have; by patient_id;
do date = date_start to date_stop;
d1{date} = sum(d1{date}, drug1_dose);
d2{date} = sum(d2{date}, drug2_dose);
end;
end;
do date = &dstart.+1 to &dend.;
if d1{date} ne d1{date-1} or d2{date} ne d2{date-1} then do;
if not missing(date_start) then do;
date_stop = date-1;
drug1_dose = d1{date-1};
drug2_dose = d2{date-1};
if drug1_dose > 0 or drug2_dose > 0 then output;
call missing(date_start, drug1_dose, drug2_dose);
end;
date_start = date;
end;
end;
format date yymmdd10.;
keep patient_id date_start date_stop drug1_dose drug2_dose;
run;
@PGStats Regarding the last code, what happens if I have another variable. Let say, this is the data that I have (where I add variable "category"):
data have;
input patient_ID category drug_dose (date_start date_stop) (:mmddyy10.);
format date_start date_stop yymmdd10.;
datalines;
1 1 10 04/26/2008 11/10/2008
1 1 5 06/16/2008 02/02/2009
1 2 7 10/13/2008 11/12/2008
2 1 3 04/30/2008 10/27/2008
2 2 11 08/17/2008 05/14/2009
;
and this the data I want to have
data want;
input patient_ID category drug_dose (date_start date_stop) (:mmddyy10.);
format date_start date_stop yymmdd10.;
datalines;
1 1 10 04/26/2008 06/15/2008
1 1 15 06/16/2008 11/10/2008
1 1 5 11/11/2008 02/02/2009
1 2 7 10/13/2008 11/12/2008
2 1 3 04/30/2008 10/27/2008
2 2 11 08/17/2008 05/14/2009
;
I tried the second option you provided (using sql and expanding the data), but my data is very huge and I run out of space. So, I'd be very thankful if you direct me in this case.
If your category takes values 1 and 2 then the last solution I provided (for drug1 and drug2) is simple to adapt and will not consume much memory.
You could also process each category separately for each patient by sorting by patientId and category and loop until(last.category).
If you run into problems, post the code you have tried and the messages and/or results that you got.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.