Solved: Re: Calculating Cumulative days with Overlap

serena13lee · Posted 07-14-2019 01:36 AM

I have a data set of patients that start their medication on a certain date and end on a certain date. I would like to calculate the cumulative number of days that they were on medications given that some of them took more than one medication.

data have;
    input ID $ medication	start_date :mmddyy10. end_date :mmddyy10.;
    datalines;
A001 1 8/26/2015 9/3/2015
A001 1 9/1/2015 9/17/2017
A002 1 10/2/2015 10/2/2015
A003 1 10/30/2015 10/30/2015
A003 2 11/24/2015 11/24/2015
A004 1 3/17/2016 3/25/2016
A004 2 3/24/2016 3/27/2016
A004 3 3/31/2016 3/31/2016
;

In this case, patient A003 took two medications without overlapping but patient A004 took 3 overlapping medications. How can I create a 5th column that lists the number of cummulative days that they've taken medications? The only idea I have is to create a flag that the patient is taking multiple drugs and that I should subtract the last end date from the first start date, but I wouldn't know how to get rid of the days that the patients didn't take medication in between.

FreelanceReinh · Posted 07-14-2019 07:31 AM

Hi @serena13lee,

If you just need the cumulative number of days on any drug, you can keep the code simple (at the expense of several seconds of run time for an input dataset with, say, 500,000 observations from 50,000 patients with a mean duration of 10 years on drug):

/* Determine earliest start date and latest end date */

proc sql noprint;
select min(start_date), max(end_date)
into :firstdate, :lastdate
from have;
quit;

/* Compute cumulative number of days on drug */

data want(drop=_:);
array _d[&firstdate:&lastdate];
do until(last.id);
  set have;
  by id;
  do _i=start_date to end_date;
    _d[_i]=1;
  end;
  days=sum(of _d[*]);
  output;
end;
label days='Cumulative number of days on any drug';
run;

View solution in original post

koyelghosh · Posted 07-14-2019 04:01 AM

Please see the screenshot, if this is what you wanted. There is column for CumulativeDaysPerMed (number of days on each medicine) and CumulativeDaysPerPatient (number of days for each patient for all medicines). If this is what you wanted then the code follows the image.

Output

DATA WANT (DROP=CurrDay);
	SET have;
	FORMAT start_date mmddyy10. end_date mmddyy10. CurrDay 8. CumulativeDayPerMed 8.;
	BY ID medication;
	RETAIN CumulativeDayPerPatient 0;
	RETAIN CumulativeDayPerMed 0;
	IF LAST.medication AND LAST.ID THEN DO;
		CurrDay = INTCK('day',start_date,end_date);
		CumulativeDayPerMed + CurrDay;
		CumulativeDayPerPatient + CumulativeDayPerMed;
		OUTPUT;
		CumulativeDayPerMed = 0;
		CumulativeDayPerPatient = 0;
	END;
	ELSE DO;
		CurrDay = INTCK('day',start_date,end_date);
		CumulativeDayPerMed + CurrDay;
		CumulativeDayPerPatient + CumulativeDayPerMed;
		OUTPUT;
	END;	
RUN;
	
PROC PRINT DATA=Want;
RUN;

Please let me know if I was able to read the problem requirement correctly.

Thank you.

koyelghosh · Posted 07-14-2019 04:56 AM

@serena13lee I am sorry, my previous post was not correct, for two reasons. Please see the code and the output as below.

The output of the modified code is below

Modified output

DATA WANT (DROP=CurrDay);
	SET have;
	FORMAT start_date mmddyy10. end_date mmddyy10. CurrDay 8. CumulativeDayPerMed 8. CumulativeDayPerPatient 8.;
	BY ID medication;
	RETAIN CumulativeDayPerPatient 0;
	RETAIN CumulativeDayPerMed 0;
	IF LAST.medication AND NOT LAST.ID THEN DO;
		CurrDay = INTCK('day',start_date,end_date)+1;
		CumulativeDayPerMed + CurrDay;
		CumulativeDayPerPatient + CumulativeDayPerMed;
		OUTPUT;
		CumulativeDayPerMed = 0;
	END;
	ELSE IF LAST.medication AND LAST.ID THEN DO;
		CurrDay = INTCK('day',start_date,end_date)+1;
		CumulativeDayPerMed + CurrDay;
		CumulativeDayPerPatient + CumulativeDayPerMed;
		OUTPUT;
		CumulativeDayPerMed = 0;
		CumulativeDayPerPatient = 0;
	END;	
	ELSE DO;
		CurrDay = INTCK('day',start_date,end_date)+1;
		CumulativeDayPerMed + CurrDay;
		CumulativeDayPerPatient + CumulativeDayPerMed;
		OUTPUT;
	END;	
RUN;
	
PROC PRINT DATA=Want;
RUN;

Hopefully this will suit what you wanted.

singhsahab · Posted 07-14-2019 04:04 AM

Hello ,

Hope so below code will work for your requirement. Kindly let's know how you want to handle missing values.

data have;
    input ID $ medication	start_date :mmddyy10. end_date :mmddyy10.;
    datalines;
A001 1 8/26/2015 9/3/2015
A001 1 9/1/2015 9/17/2017
A002 1 10/2/2015 10/2/2015
A003 1 10/30/2015 10/30/2015
A003 2 11/24/2015 11/24/2015
A004 1 3/17/2016 3/25/2016
A004 2 3/24/2016 3/27/2016
A004 3 3/31/2016 3/31/2016
;
run;

proc sql;
create table want as
select a.id,a.medication,a.start_date format=date9.,a.end_date format=date9. from have as a , have as b
where (a.id=b.id and a.medication<>b.medication and a.start_date between b.start_date and b.end_date) 
or
(a.id=b.id and a.medication <> b.medication and b.start_date between a.start_date and a.end_date)
or
(a.id=b.id and a.medication <> b.medication and a.end_date  between b.end_date and b.start_date)
or
(a.id=b.id and a.medication <> b.medication and b.end_date  between a.end_date and a.start_date)
or
(a.id=b.id and a.medication <> b.medication and a.end_date=b.end_date and a.start_date=b.start_date)
or
(a.id=b.id and a.medication <> b.medication and b.end_date=a.end_date and b.start_date=a.start_date);
quit;

Thanks

🙂

Patrick · Posted 07-14-2019 06:59 AM

@serena13lee

Below code for how I understood your requirement.

data have;
    input ID $ medication	start_date :mmddyy10. end_date :mmddyy10.;
    format start_date end_date date9.;
    datalines;
A001 1 8/26/2015 9/3/2015
A001 1 9/1/2015 9/17/2017
A002 1 10/2/2015 10/2/2015
A003 1 10/30/2015 10/30/2015
A003 2 11/24/2015 11/24/2015
A004 1 3/17/2016 3/25/2016
A004 2 3/24/2016 3/27/2016
A004 3 3/31/2016 3/31/2016
;

proc sort data=have;
  by id start_date;
run;

data 
  inter(drop=cluster_start_date cluster_end_date)
  clusters(keep=id cluster_:);
  set have;
  by id start_date;
  format cluster_id 16. cluster_start_date cluster_end_date date9.;
  retain cluster_start_date cluster_end_date;

  if first.id then 
    do;
      cluster_id+1;
      cluster_start_date=start_date;
      cluster_end_date=end_date;
    end;
  else
  if start_date<=cluster_end_date then 
    do;
      cluster_end_date=max(cluster_end_date, end_date);
    end;
  else
    do;
      output clusters;
      cluster_id+1;
      cluster_start_date=start_date;
      cluster_end_date=end_date;
    end;
  if last.id then output clusters;

  output inter;
run;
  
data want;
  merge inter clusters;
  by id cluster_id;
  retain days_of_med_in_cluster days_of_med_cum;
  if first.cluster_id then 
    do;
      days_of_med_in_cluster= cluster_end_date-cluster_start_date+1;
      if first.id then days_of_med_cum=days_of_med_in_cluster;
      else days_of_med_cum=sum(days_of_med_cum,days_of_med_in_cluster);
    end;
run;

proc print data=want;
run;

FreelanceReinh · Posted 07-14-2019 07:31 AM

Hi @serena13lee,

If you just need the cumulative number of days on any drug, you can keep the code simple (at the expense of several seconds of run time for an input dataset with, say, 500,000 observations from 50,000 patients with a mean duration of 10 years on drug):

/* Determine earliest start date and latest end date */

proc sql noprint;
select min(start_date), max(end_date)
into :firstdate, :lastdate
from have;
quit;

/* Compute cumulative number of days on drug */

data want(drop=_:);
array _d[&firstdate:&lastdate];
do until(last.id);
  set have;
  by id;
  do _i=start_date to end_date;
    _d[_i]=1;
  end;
  days=sum(of _d[*]);
  output;
end;
label days='Cumulative number of days on any drug';
run;

hashman · Posted 07-14-2019 03:59 PM

@FreelanceReinh:

I more than second the approach (which I personally call "paint brushing" and routinely use in similar situations, particularly for calculating what is called "continuous enrollment" in the insurance industry). Some things I'd do differently, though:

If I understand your intent correctly, the reason you didn't make the array temporary is to let it be auto-reinitialized to missing values at the top of the implied loop each time before the DoW-loop starts iterating for the next BY group. I'd still make it temporary and add CALL MISSING before of after the DoW, which would be the more efficient compared to the non-temp array, the wider the date range should happen to be. In fact, since a temp array can easily swallow millions of items, it can be just made "large enough" without the need for the preliminary data pass determining the array bounds.
Instead of _d[_i]=1, _d[_i]+1 can be coded. This way, if the OP wants to count 1 day on N medications as N days, the SUM function will do it; otherwise if it to be counted as 1 regardless of N, the N function will work (in your code, N works just as well as SUM with the same result).
In version 9.3 (many orgs still run it), the lowest negative array bound is limited to -128. That is, you can compile an array with the lower bound -129 and lower error-free, yet at run time an array reference with the index lower than -128 will bomb with the error message "ERROR: Array subscript out of range"; this is true for both temp and non-temp arrays. To reliably guard against such an occurrence (i.e. dealing with dates 129 or more days before 01jan1960), the array bounds can be shifted up by &firstdate and the reference _d[_i] can be replaced with _d[_i+&firstdate] accordingly. (Interestingly, this curious 9.3 bug didn't exist before the transition from version 8 to version 9 and was fixed in 9.4.)

Kind regards

Paul D.

FreelanceReinh · Posted 07-15-2019 04:09 AM

@hashman:

Thank you very much, Paul, for your detailed comments.

I had tested both types of arrays using an input dataset with about 400,000 obs. and 50,000 patients who were 4,000 days on drug (where the run times were between 5 and 6 seconds). To my surprise, the solution using the temporary array (but not a DoW loop) seemed to be a bit slower (by only 0.4 seconds, so possibly not a significant difference) and because it also required slightly more code I settled for the ordinary array. The possibility of saving the preliminary PROC SQL step is a good point, though. I had included it to avoid any assumptions on &firstdate and &lastdate.
Yes, I should have used the N function, just because it's the simpler operation, hence likely more efficient. The implied RETAIN of the sum statement (+1) would have required the explicit initialization of the array (for what I assumed was the intended result), which I wanted to avoid.
This is interesting. Thanks for making me aware of this bug.

hashman · Posted 07-15-2019 11:25 AM

@FreelanceReinh: Thank you for continuing the interesting conversation.

"The implied RETAIN of the sum statement (+1) would have required the explicit initialization of the array (for what I assumed was the intended result), which I wanted to avoid."

Very true. Though it can also be avoided by using _d[_i]=sum(_d[_i],1) instead - a usual subterfuge in this kind of situation.

On a different note, a temp array can be reinitialized extremely rapidly to any value using CALL FILLMATRIX compiled into a custom call routine via FCMP - with the limitations that (a) the array is 1-based and (b) numeric. I've posted a way of doing it somewhere in this space not a very long time ago (though, as if often happens with pros, given the hint, it would take less time for you to code it yourself than to find my code).

Kind regards

Paul D.

FreelanceReinh · Posted 07-15-2019 01:37 PM

@hashman wrote:

@FreelanceReinh: Thank you for continuing the interesting conversation.

My pleasure!

... the explicit initialization of the array (...) can also be avoided by using _d[_i]=sum(_d[_i],1) instead - a usual subterfuge in this kind of situation.

Yes, sure. I think I've learned this technique about ten years ago from your (and Koen Vyverman's) inspiring 2009 paper "The DOW-Loop Unrolled" and I've used it many times.

@hashman wrote:
On a different note, a temp array can be reinitialized extremely rapidly to any value using CALL FILLMATRIX ....

I've posted a way of doing it somewhere in this space not a very long time ago ....

Correct. This was when we discussed my CALL STDIZE approach in Set All Array Values to 0. I'm looking forward to using this and other non-IML matrix functions and CALL routines. Thanks again.

hashman · Posted 07-16-2019 01:26 AM

@FreelanceReinh:

Ah, yeah. In 2008 Koen and I had reckoned that the DoW deserved its own dedicated paper after having been merely a section of "The Magnificent DO" since SESUG 2000; and so we conspired to write a paper and present it using the DATA step debugger rather than PP slides, with Koen at the keyboard and me talking and pointing at the screen. Perhaps it would've been better the other way around since Koen, fluent in 7 languages, speaks English better, not to mention that his Flemish accent is less intrusive than my Ukrainian. At any rate, I'm glad we did it to encapsulate the DoW's functionality in a paper devoted to it and nothing else.

Thank you for reminding of the CALL STDIZE thread. Your associative memory is obviously much better than mine. Another thing I must mention is that it was Roger DeAngelis who brought your method to my attention on SAS-L, which in turn spurned me to compare it with the APP and CALL FILLMATRIX. It's pretty incredible how much we can learn from each other here.

Kind regards

Paul D.

FreelanceReinh · Posted 07-16-2019 03:20 AM

@hashman wrote:

we ... present it using the DATA step debugger rather than PP slides, with Koen at the keyboard and me talking and pointing at the screen.

Nice idea!

I must mention ... that it was Roger DeAngelis who brought your method to my attention on SAS-L, ...

I didn't know that. Thanks. A quick Google search ("De Angelis" "CALL STDIZE") brought up this 13 days old GitHub entry: https://github.com/rogerjdeangelis/utl-set-each-element-of-an-array-to-zero-or-any-constant. I'm amazed how quickly content from this forum is being spread.

hashman · Posted 07-16-2019 03:48 AM

@FreelanceReinh:

Roger has been doing an amazing job over recent years compiling best solutions of data processing problems from both SAS community and SAS-L and comparing their SAS/R/Python solutions (and their hybrids thereof) on github. A truly wondrous wealth of info there, not to mention that for every problem, Roger goes to great lengths to reformulate it in strikingly lucid terms, replete with HAVE, WANT, and clearly stated processing rules in his own inimitable style.

Kind regards

Paul D.

serena13lee · Posted 07-22-2019 04:28 AM

Hi! Thank you for your solution. I am currently fitting you code to my data and it looks great. Unfortunately, I have a criteria which I'm not sure how to adapt your code to and was wondering if there might be a simple fix or if I should ask the community. So the criteria I had was performing the cumulative number of days on medication for patients that took medications for >=15 consecutive days . So in the case of patient A004, they took medication for 8 days, then 3 days, then 1 day (11 days total). Since they didn't take medication for >=15 consecutive days, I won't calculate how many days they took medication for. Sorry if this is confusing. Please let me know if you would like an example. Thanks.

FreelanceReinh · Posted 07-22-2019 10:36 AM

Hi @serena13lee,

Thanks for accepting my solution.

@serena13lee wrote:

(...) So in the case of patient A004, they took medication for 8 days, then 3 days, then 1 day (11 days total).

These counts do not exactly match those produced by my solution: 9 days, then 4 (of which 2 are overlapping), then 1 day (12 days total). If your counting rule was days=end_date-start_date (which is not uncommon) rather than days=end_date-start_date+1, then the last record of your sample data would be inconsistent (as resulting in "0 days").

That said, my code creates an array _d whose elements correspond to the calendar days between the earliest start date and the latest end date. Assuming that the array is populated appropriately, the array element for the i-th date in this time period indicates that the patient took some medication on that day if _d[i]=1. Otherwise, the array element contains a numeric missing value: _d[i]=.. So, a period of 15 consecutive days on drug means a sequence of 15 consecutive 1s somewhere in the array. You may want to create a flag variable, say, consec15d which is set to 1 if such a 15-day sequence exists and which is set to 0 otherwise. To create this flag variable, insert the definition between days=...; and the output statement:

if ~consec15d then consec15d=find(cats(of _d[*]),repeat('1',14))>0;

(Note that repeat('1',14)='111111111111111' with 15 digits.) I would assign a label to this variable such as '0-1 flag for 15 consecutive days on drug'.

The flag will be set to 1 in the first record where the 15-days condition is met and then left unchanged (due to the IF condition, which I've included for performance reasons) because the maximum number of consecutive days on drug can never decrease within an ID, assuming that the records are sorted chronologically. Since CATS creates a character string of the length &lastdate-&firstdate+1, I would be hesitant to apply this technique if that length exceeded 32767 characters, but even a time span of 89 years wouldn't reach this value.

I leave it to you to alter the value of variable DAYS (if necessary) in cases where CONSEC15D=0.

Register Today!

SAS Training: Just a Click Away