Hello everyone,
I have a dataset where each patient (represented by ID) is prescribed a medication (represented by dose) over different time periods. Here is an example of my data, where for particular purposes each patient does not have a fixed dose on each day:
id | date | dose |
1 | 1/17/2008 | 7.5 |
… | … | … |
1 | 2/5/2008 | 8 |
1 | 5/7/2008 | 5 |
… | … | … |
1 | 6/15/2008 | 7 |
2 | 2/2/2008 | 10 |
… | … | … |
2 | 3/17/2008 | 10 |
… | … | … |
I also have another dataset of patients experiencing events (i.e., hospital readmission) at different times. Here is an example:
id | date1 | date2 | date3 | date4 | date5 | … | date20 |
1 | 2/20/2008 | 4/17/2008 | … | … | … | … | … |
2 | 4/27/2008 | 6/7/2008 | 9/11/2008 | … | … | … | … |
… |
where
Now, for each patient, I want to measure the average dose used between each two consecutive events. Here are a couple of points:
Finally, I'd be very thankful if the code can accommodate reading two separate data files (like the tables above).
Thank you very much in advance!
Please post data that we can use and the expected output that lines up with that data.
Basically an replicable example that we can use to write the code. I'm not making up data AND answering your question 😉
The bare minimum needed to replicate your issue. You know your data, I don't.
It might be a single record for one person - make sure you de-identify it and don't include variables that aren't relevant to the problem.
See this post on how to post sample data as well:
Here is a part of the data (I changed it for more clarification). The data that includes the time along with dose for each patient looks like this:
data have1;
input id date drug_dose;
datalines;
1 2008-01-17 7.5
1 2008-01-18 14.7
1 2008-01-19 21.5
1 2008-01-20 27.3
1 2008-01-21 32.1
1 2008-01-22 35.9
1 2008-01-23 38.6
1 2008-01-24 40.6
1 2008-01-25 41.8
1 2008-01-26 42.6
1 2008-03-01 129.4
1 2008-03-02 107.7
1 2008-03-03 86.2
1 2008-03-04 34.4
1 2008-03-05 9.1
2 2008-07-07 30
2 2008-07-08 59.1
2 2008-07-09 86
2 2008-07-10 109.4
2 2008-07-11 128.6
2 2008-07-12 143.6
2 2008-10-23 15
2 2008-10-24 29.5
2 2008-10-25 43
2 2008-10-26 54.7
2 2008-10-27 64.3
;
Also, the second dataset which contains the information about time of different events looks like this:
data have2;
input id date1 date2 date3 date3 date4 date5;
datalines;
1 2008-2-6 2008-3-19
2 2008-5-9 2008-8-1 2008-9-5
3 2008-4-3 2008-6-5
;
Now, the data I want to have looks like this:
data want;
input id avg_dose time_start time_stop;
datalines;
1 15.13 2008-1-17 2008-2-6
1 13.585 2008-2-7 2008-3-5
1 0 2008-3-6 2008-3-19
2 0 2008-5-9 2008-7-6
2 12.268 2008-7-7 2008-8-1
2 0 2008-8-2 2008-9-5
2 4.05 2008-9-6 2008-10-27
;
For example, 15.13 in the first row is obtained as 302.6/20, where
I hope that this would be helpful. Once again, thank you for your help!
Since there hasn't been any solution for this question, can I repost it?
Feel free. Your question isn't clear, which is why I didn't try and answer it.
What are the rules for determining the dates?
For patient #1, the first event happened on 2008-2-6. Also, for this patient, the medication was prescribed for the first time on 2008-1-17 and lasted until 2008-1-26. So, the cumulative dose in these 10 days was 302.6. However, as I said in my original post, I want to return the average dose used between each two consecutive event dates. That's why I divide 302.6 by 20 (which is the time period from 2008-1-17 until 2008-2-6).
Ok, that clarifies things. The explanation made it seem more complex.
1. Create an episode variable that defines an episode for each ID. For example, 2008-1-17 to 2008-1-26 would be a single episode.
You can use the DIF() function for this to calculate the EPISODE.
2. Use PROC MEANS with ID and EPISODE (from #1)
3. Merge it with your second data set
If you can't figure out a step, post your code and log and we can help from there.
Thanks Reeza! I came up with this approach (which may be slightly different with what you recommended).
I merged two data sets like this (I'm showing the results for only patient #1):
data have;
input id date drug_dose indicator;
datalines;
1 2008-01-17 7.5 .
1 2008-01-18 14.7 .
1 2008-01-19 21.5 .
1 2008-01-20 27.3 .
1 2008-01-21 32.1 .
1 2008-01-22 35.9 .
1 2008-01-23 38.6 .
1 2008-01-24 40.6 .
1 2008-01-25 41.8 .
1 2008-01-26 42.6 .
1 2008-02-06 . event
1 2008-03-01 129.4 .
1 2008-03-02 107.7 .
1 2008-03-03 86.2 .
1 2008-03-04 34.4 .
1 2008-03-05 9.1 .
1 2008-03-19 . event
;
where I added a new variable (i.e., indicator) that puts "event" whenever an event occurs.
Now, my questions is that how can I get an output like this:
data want;
input id avg_dose time_start time_stop;
datalines;
1 15.13 2008-1-17 2008-2-6
1 13.585 2008-2-7 2008-3-5
1 0 2008-3-6 2008-3-19
;
where "avg_dose" is obtained as the ratio of total cumulative dose between each two events over the total number of days spanned between these two events.
Your data structure doesn't lend itself very well to that type of summary, ergo my recommendation.
With your approach you'll need a data step and to calculate the averages and such manually - this is prone to error and not very flexible for adding further statistics. If you decide you want the standard deviation and max/min dose that's all a manual calculation.
Here's an example of this with the sample data you provided above.
I don't understand. Why should I merge "episodes" with the second data? What is the purpose of "proc means"?
Run it and see.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.