Hello all,
So I have a dataset of clients with one or more social service episodes and within each episode there can be one or more events. Each event has a begin date and an end date. I want to make a wide dataset so that there is a column for each day for a given time period, for example, calendar year 2019. And for each event I want to flag the days that occur during the given time period. There is more that I want to do but this would be a great start.
Here is an example data set:
data have;
infile datalines; input Client_ID $ Episode_ID $ Event_ID $ Begin_Date :date9. End_Date :date9. Event_Plcmnt $; format Begin_date End_Date date9.; datalines; 001 001 001 20MAY2019 23MAY2019 GC 001 001 002 22MAY2019 23MAY2019 OTHER 001 001 003 23MAY2019 23OCT2019 GC 001 001 004 29JUN2019 30JUN2019 OTHER 001 001 006 14JUL2019 15JUL2019 OTHER 001 001 007 19JUL2019 23JUL2019 OTHER 001 001 005 25JUL2019 08AUG2019 OTHER 001 001 008 09AUG2019 10AUG2019 OTHER 001 001 009 11AUG2019 12AUG2019 OTHER 001 002 010 01NOV2019 16APR2020 URC
And here is what I want it to look like:
Client_ID | Episode_ID | Event_ID | Begin_Date | End_Date | Event_Plcmnt | 01JAN2019 | 02JAN2019 | ... | 20MAY2019 | 21MAY2019 | 22MAY2019 | 23MAY2019 | 24MAY2019 | 25MAY2019 | ... | 30DEC2019 | 31DEC2019 |
001 |
001 | 001 | 20May2019 | 23May2019 | GC | ... | 1 | 1 | 1 | 1 | ... | ||||||
001 | 001 | 002 | 22May2019 | 23May2019 |
Other |
... | 1 | 1 | ... | ||||||||
001 | 001 | 003 | 23May2019 | 23Oct2019 | GC | ... | 1 | 1 | 1 | ... | |||||||
001 | 001 | 004 | 29Jun2019 | 30Jun2019 | Other | ... | ... | ||||||||||
001 | 001 | 005 | 14Jul2019 | 15Jul2019 | Other | ... | ... | ||||||||||
001 | 001 | 006 | 19Jul2019 | 23Jul2019 | Other | ... | ... | ||||||||||
001 | 001 | 007 | 25Jul2019 | 8Aug2019 | Other | ... | ... | ||||||||||
001 | 001 | 008 | 09Aug2019 | 10Aug2019 | Other | ... | ... | ||||||||||
001 | 001 | 009 | 11Aug2019 | 12Aug2019 | Other | ... | ... | ||||||||||
001 | 002 | 010 | 01Nov2019 | 16Apr2020 | URC | ... | ... | 1 | 1 |
I've truncated, of course, to save space. Also, there are a number of other columns that I did not include in the sample dataset that I would need to carry over.
Let me know what you all think.
Best,
Marc
If all of your client ids were involved with the same date ranges there might be some limited sense in a "wide" format but one suspects that your actual data doesn't meet that requirement.
Making a long set is easy:
data want; set have; do ddate = begin_date to End_date; output; end; format ddate date9.; run;
DDate has each date in the given intervals. it would be very easy to add a flag variable to the above.
How to use the above depends on what you are actually trying to accomplish. Having a data set with hundreds of "flag" variables is seldom easy to work with.
So I have a dataset of clients with one or more social service episodes and within each episode there can be one or more events. Each event has a begin date and an end date. I want to make a wide dataset ...
Stop right there. You should not make this a wide dataset. There are almost no cases where you really need to do this, and it makes the rest of your programming much more difficult. Of course, to be 100% sure, we need to know what analysis or report you plan to do with this wide data set; most of the time, the same analysis or report can be obtain with much less work by NOT making a wide data set.
See Maxim 19: Long beats wide.
Thanks @PaigeMiller ,
Thanks for the response. Interesting, I was told to try arrays by someone at my work.
Here is the issue. I am trying to calculate the number of occurrences of maltreatment that occur per 100,000 days in care for clients who are in one of five placement settings (the event_plcmnt column). To do this I need to total the number of days that clients spend in one of these settings. It would be nice if the data were clean and straight forward but it isn't. The dates between events often overlap or have gaps and sometimes the dates for events are nested inside the dates for another event in the same episode. What I need to do is find a way to fix the gaps and overlaps and fix them and then join with table of maltreatment occurrence dates to see where the client was when the maltreatment occurred. Finally I total the number of days for each placement setting for all the clients and the number of maltreatments in each of the five setting and divide the latter by the former to get the rates. Can I do this with the long format?
@ModeratelyWise wrote:
Thanks @PaigeMiller ,
Thanks for the response. Interesting, I was told to try arrays by someone at my work.
Here is the issue. I am trying to calculate the number of occurrences of maltreatment that occur per 100,000 days in care for clients who are in one of five placement settings (the event_plcmnt column). To do this I need to total the number of days that clients spend in one of these settings. It would be nice if the data were clean and straight forward but it isn't. The dates between events often overlap or have gaps and sometimes the dates for events are nested inside the dates for another event in the same episode. What I need to do is find a way to fix the gaps and overlaps and fix them and then join with table of maltreatment occurrence dates to see where the client was when the maltreatment occurred. Finally I total the number of days for each placement setting for all the clients and the number of maltreatments in each of the five setting and divide the latter by the former to get the rates. Can I do this with the long format?
All can be done with a long data set, code provided by @ballardw . The method of cleaning up depends on your knowledge of the data, and so you have to decide what gaps to fix, what overlaps to fix, and what is the proper method of fixing them, but the coding which we can help you with is much simpler with a long data set.
If all of your client ids were involved with the same date ranges there might be some limited sense in a "wide" format but one suspects that your actual data doesn't meet that requirement.
Making a long set is easy:
data want; set have; do ddate = begin_date to End_date; output; end; format ddate date9.; run;
DDate has each date in the given intervals. it would be very easy to add a flag variable to the above.
How to use the above depends on what you are actually trying to accomplish. Having a data set with hundreds of "flag" variables is seldom easy to work with.
Thanks it works just like you said. Do you think I can attain the output I talk about in my above comment with a long data set?
@ModeratelyWise wrote:
Thanks it works just like you said. Do you think I can attain the output I talk about in my above comment with a long data set?
If you add a "flag" numeric value you can make a report that reads like that very easily.
data want; set have; do ddate = begin_date to End_date; flag=1; output; end; format ddate date9.; run; proc tabulate data=want; class Client_ID Episode_ID Event_ID Event_Plcmnt ddate; var flag; table Client_ID*Episode_ID* Event_ID*Event_Plcmnt, ddate=' '*flag=' '*sum=' '*f=best5. /misstext=' ' ; run;
Though with the extreme number of dates likely to result I might suggest making that the row variable and the other variables the column variables (unless you have a lot of Client_id values in which case the result will be ugly pretty much whatever you do).
The =' ' are suppressing variable names or statistic names from the report. The f=best5. is showing the "sum" without decimals which is the default for the procedure. The /misstext option sets what to show for missing values of the sum, in this case a simple blank, so you do not see mostly cells with a dot for value.
More variables might make Proc Report more attractive with the DDATE as an across variable. Depends a lot on actual data.
Without some serious justification I won't even attempt to create a data set.
Thanks @ballardw! I have thousands of IDs each with tens of events...so this is going to be ugly any way I look at it. But if I do the long method you say I should be able to trim overlaps and fill in gaps without going to wide? So I will use your original code without the tabular format. Would you object, if I have problems trying to fix gaps and overlaps, if I ask for further help? Should I just post it here or create a new thread?
Best,
Marc
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.