Re: Average of observations between multiple specific dates

Alireza_Boloori · Posted 10-10-2017 06:55 PM

Hello everyone,

I have a dataset where each patient (represented by ID) is prescribed a medication (represented by dose) over different time periods. Here is an example of my data, where for particular purposes each patient does not have a fixed dose on each day:

id	date	dose
1	1/17/2008	7.5
…	…	…
1	2/5/2008	8
1	5/7/2008	5
…	…	…
1	6/15/2008	7
2	2/2/2008	10
…	…	…
2	3/17/2008	10
…	…	…

I also have another dataset of patients experiencing events (i.e., hospital readmission) at different times. Here is an example:

id	date1	date2	date3	date4	date5	…	date20
1	2/20/2008	4/17/2008	…	…	…	…	…
2	4/27/2008	6/7/2008	9/11/2008	…	…	…	…
…

where

each patient might experience different number of events, and
the maximum number of events occurred for any patient in my database is 20 (that's why I have "data20" as my last variable in the second table).

Now, for each patient, I want to measure the average dose used between each two consecutive events. Here are a couple of points:

The average is measured by taking the cumulative dose in that period and dividing it by the total number of days in that period.
If an event occurs for the first time, then two conditions may happen: whether the event occurs before or after the time of the first dose. For the former, the average dose in that period will be 0. For the latter, the average will be measured from the time of the first dose until that event.
It's possible that, for a patient, no event occurs. In this case, I take the average for the period between the first and last dose.

Finally, I'd be very thankful if the code can accommodate reading two separate data files (like the tables above).

Thank you very much in advance!

Reeza · Posted 10-10-2017 08:00 PM

Please post data that we can use and the expected output that lines up with that data.

Basically an replicable example that we can use to write the code. I'm not making up data AND answering your question 😉

Alireza_Boloori · Posted 10-10-2017 08:38 PM

@Reeza Thanks! The dataset is very large (50+ million observations). So, I was wondering how much of it would be required for you? Please let me know.

Reeza · Posted 10-10-2017 08:40 PM

The bare minimum needed to replicate your issue. You know your data, I don't.

It might be a single record for one person - make sure you de-identify it and don't include variables that aren't relevant to the problem.

See this post on how to post sample data as well:

https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...

Alireza_Boloori · Posted 10-10-2017 09:17 PM

Here is a part of the data (I changed it for more clarification). The data that includes the time along with dose for each patient looks like this:

data have1;
input id date drug_dose;
datalines;
1 2008-01-17 7.5
1 2008-01-18 14.7
1 2008-01-19 21.5
1 2008-01-20 27.3
1 2008-01-21 32.1
1 2008-01-22 35.9
1 2008-01-23 38.6
1 2008-01-24 40.6
1 2008-01-25 41.8
1 2008-01-26 42.6
1 2008-03-01 129.4
1 2008-03-02 107.7
1 2008-03-03 86.2
1 2008-03-04 34.4
1 2008-03-05 9.1
2 2008-07-07 30
2 2008-07-08 59.1
2 2008-07-09 86
2 2008-07-10 109.4
2 2008-07-11 128.6
2 2008-07-12 143.6
2 2008-10-23 15
2 2008-10-24 29.5
2 2008-10-25 43
2 2008-10-26 54.7
2 2008-10-27 64.3
;

Also, the second dataset which contains the information about time of different events looks like this:

data have2;
input id date1 date2 date3 date3 date4 date5;
datalines;
1 2008-2-6 2008-3-19
2 2008-5-9 2008-8-1 2008-9-5
3 2008-4-3 2008-6-5
;

Now, the data I want to have looks like this:

data want;
input id avg_dose time_start time_stop;
datalines;
1 15.13 2008-1-17 2008-2-6 
1 13.585 2008-2-7 2008-3-5
1 0 2008-3-6 2008-3-19
2 0 2008-5-9 2008-7-6
2 12.268 2008-7-7 2008-8-1
2 0 2008-8-2 2008-9-5
2 4.05 2008-9-6 2008-10-27
;

For example, 15.13 in the first row is obtained as 302.6/20, where

302.6 is the sum of dose values for patient #1 from 2008-1-17 until 2008-1-26, and
20 is the total number of days from 2008-1-17 until 2008-2-6.

I hope that this would be helpful. Once again, thank you for your help!

Alireza_Boloori · Posted 10-12-2017 09:41 PM

Since there hasn't been any solution for this question, can I repost it?

Reeza · Posted 10-12-2017 10:38 PM

Feel free. Your question isn't clear, which is why I didn't try and answer it.

302.6 is the sum of dose values for patient #1 from 2008-1-17 until 2008-1-26, <- where do the dates come from?

What are the rules for determining the dates?

Alireza_Boloori · Posted 10-13-2017 12:32 AM

For patient #1, the first event happened on 2008-2-6. Also, for this patient, the medication was prescribed for the first time on 2008-1-17 and lasted until 2008-1-26. So, the cumulative dose in these 10 days was 302.6. However, as I said in my original post, I want to return the average dose used between each two consecutive event dates. That's why I divide 302.6 by 20 (which is the time period from 2008-1-17 until 2008-2-6).

Reeza · Posted 10-13-2017 10:54 AM

Ok, that clarifies things. The explanation made it seem more complex.

1. Create an episode variable that defines an episode for each ID. For example, 2008-1-17 to 2008-1-26 would be a single episode.

You can use the DIF() function for this to calculate the EPISODE.

2. Use PROC MEANS with ID and EPISODE (from #1)

3. Merge it with your second data set

If you can't figure out a step, post your code and log and we can help from there.

Alireza_Boloori · Posted 10-13-2017 03:28 PM

Thanks Reeza! I came up with this approach (which may be slightly different with what you recommended).

I merged two data sets like this (I'm showing the results for only patient #1):

data have;
input id date drug_dose indicator;
datalines;
1 2008-01-17 7.5 .
1 2008-01-18 14.7 .
1 2008-01-19 21.5 .
1 2008-01-20 27.3 .
1 2008-01-21 32.1 .
1 2008-01-22 35.9 .
1 2008-01-23 38.6 .
1 2008-01-24 40.6 .
1 2008-01-25 41.8 .
1 2008-01-26 42.6 .
1 2008-02-06 . event
1 2008-03-01 129.4 .
1 2008-03-02 107.7 .
1 2008-03-03 86.2 .
1 2008-03-04 34.4 .
1 2008-03-05 9.1 .
1 2008-03-19 . event
;

where I added a new variable (i.e., indicator) that puts "event" whenever an event occurs.

Now, my questions is that how can I get an output like this:

data want;
input id avg_dose time_start time_stop;
datalines;
1 15.13 2008-1-17 2008-2-6 
1 13.585 2008-2-7 2008-3-5
1 0 2008-3-6 2008-3-19
;

where "avg_dose" is obtained as the ratio of total cumulative dose between each two events over the total number of days spanned between these two events.

Reeza · Posted 10-13-2017 03:36 PM

Your data structure doesn't lend itself very well to that type of summary, ergo my recommendation.

With your approach you'll need a data step and to calculate the averages and such manually - this is prone to error and not very flexible for adding further statistics. If you decide you want the standard deviation and max/min dose that's all a manual calculation.

Here's an example of this with the sample data you provided above.

Alireza_Boloori · Posted 10-13-2017 04:22 PM

I don't understand. Why should I merge "episodes" with the second data? What is the purpose of "proc means"?

Reeza · Posted 10-13-2017 05:54 PM

Run it and see.

SAS Innovate 2025: Call for Content

Classroom Training Available!