BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ModeratelyWise
Obsidian | Level 7

Hello all,

So I have a dataset of clients with one or more social service episodes and within each episode there can be one or more events. Each event has a begin date and an end date. I want to make a wide dataset so that there is a column for each day for a given time period, for example, calendar year 2019. And for each event I want to flag the days that occur during the given time period. There is more that I want to do but this would be a great start.

 

Here is an example data set:

 

data have;
infile datalines; input Client_ID $ Episode_ID $ Event_ID $ Begin_Date :date9. End_Date :date9. Event_Plcmnt $; format Begin_date End_Date date9.; datalines; 001 001 001 20MAY2019 23MAY2019 GC 001 001 002 22MAY2019 23MAY2019 OTHER 001 001 003 23MAY2019 23OCT2019 GC 001 001 004 29JUN2019 30JUN2019 OTHER 001 001 006 14JUL2019 15JUL2019 OTHER 001 001 007 19JUL2019 23JUL2019 OTHER 001 001 005 25JUL2019 08AUG2019 OTHER 001 001 008 09AUG2019 10AUG2019 OTHER 001 001 009 11AUG2019 12AUG2019 OTHER 001 002 010 01NOV2019 16APR2020 URC

And here is what I want it to look like:

Client_ID Episode_ID Event_ID Begin_Date End_Date Event_Plcmnt 01JAN2019 02JAN2019 ... 20MAY2019 21MAY2019 22MAY2019 23MAY2019 24MAY2019 25MAY2019 ... 30DEC2019 31DEC2019

001

001 001 20May2019 23May2019 GC     ... 1 1 1 1     ...    
001 001 002 22May2019 23May2019

Other

    ...     1 1     ...    
001 001 003 23May2019 23Oct2019 GC     ...       1 1 1 ...    
001 001 004 29Jun2019 30Jun2019 Other     ...             ...    
001 001 005 14Jul2019 15Jul2019 Other     ...             ...    
001 001 006 19Jul2019 23Jul2019 Other     ...             ...    
001 001 007 25Jul2019 8Aug2019 Other     ...             ...    
001 001 008 09Aug2019 10Aug2019 Other     ...             ...    
001 001 009 11Aug2019 12Aug2019 Other     ...             ...    
001 002 010 01Nov2019 16Apr2020 URC     ...             ... 1 1

I've truncated, of course, to save space.  Also, there are a number of other columns that I did not include in the sample dataset that I would need to carry over. 

 

Let me know what you all think.

 

Best,

 

Marc

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

If all of your client ids were involved with the same date ranges there might be some limited sense in a "wide" format but one suspects that your actual data doesn't meet that requirement.

Making a long set is easy:

 

data want;
   set have;
   do ddate = begin_date to End_date;
      output;
   end;
   format ddate date9.;
run;

DDate has each date in the given intervals. it would be very easy to add a flag variable to the above.

 

How to use the above depends on what you are actually trying to accomplish. Having a data set with hundreds of "flag" variables is seldom easy to work with.

View solution in original post

11 REPLIES 11
PaigeMiller
Diamond | Level 26

So I have a dataset of clients with one or more social service episodes and within each episode there can be one or more events. Each event has a begin date and an end date. I want to make a wide dataset ...

 

Stop right there. You should not make this a wide dataset. There are almost no cases where you really need to do this, and it makes the rest of your programming much more difficult. Of course, to be 100% sure, we need to know what analysis or report you plan to do with this wide data set; most of the time, the same analysis or report can be obtain with much less work by NOT making a wide data set.

 

See Maxim 19: Long beats wide.

--
Paige Miller
ModeratelyWise
Obsidian | Level 7

Thanks @PaigeMiller ,

Thanks for the response.  Interesting, I was told to try arrays by someone at my work.

 

Here is the issue.  I am trying to calculate the number of occurrences of maltreatment that occur per 100,000 days in care for clients who are in one of five placement settings (the event_plcmnt column).  To do this I need to total the number of days that clients spend in one of these settings.  It would be nice if the data were clean and straight forward but it isn't.  The dates between events often overlap or have gaps and sometimes the dates for events are nested inside the dates for another event in the same episode. What I need to do is find a way to fix the gaps and overlaps and fix them and then join with table of maltreatment occurrence dates to see where the client was when the maltreatment occurred.  Finally I total the number of days for each placement setting  for all the clients and the number of maltreatments in each of the five setting and divide the latter by the former to get the rates.  Can I do this with the long format?

 

 

PaigeMiller
Diamond | Level 26

@ModeratelyWise wrote:

Thanks @PaigeMiller ,

Thanks for the response.  Interesting, I was told to try arrays by someone at my work.

 

Here is the issue.  I am trying to calculate the number of occurrences of maltreatment that occur per 100,000 days in care for clients who are in one of five placement settings (the event_plcmnt column).  To do this I need to total the number of days that clients spend in one of these settings.  It would be nice if the data were clean and straight forward but it isn't.  The dates between events often overlap or have gaps and sometimes the dates for events are nested inside the dates for another event in the same episode. What I need to do is find a way to fix the gaps and overlaps and fix them and then join with table of maltreatment occurrence dates to see where the client was when the maltreatment occurred.  Finally I total the number of days for each placement setting  for all the clients and the number of maltreatments in each of the five setting and divide the latter by the former to get the rates.  Can I do this with the long format?

 

 


All can be done with a long data set, code provided by @ballardw . The method of cleaning up depends on your knowledge of the data, and so you have to decide what gaps to fix, what overlaps to fix, and what is the proper method of fixing them, but the coding which we can help you with is much simpler with a long data set.

--
Paige Miller
ballardw
Super User

If all of your client ids were involved with the same date ranges there might be some limited sense in a "wide" format but one suspects that your actual data doesn't meet that requirement.

Making a long set is easy:

 

data want;
   set have;
   do ddate = begin_date to End_date;
      output;
   end;
   format ddate date9.;
run;

DDate has each date in the given intervals. it would be very easy to add a flag variable to the above.

 

How to use the above depends on what you are actually trying to accomplish. Having a data set with hundreds of "flag" variables is seldom easy to work with.

ModeratelyWise
Obsidian | Level 7

@ballardw 

 

Thanks it works just like you said. Do you think I can attain the output I talk about in my above comment with a long data set?   

Reeza
Super User
Yes, you can 100% do what you want with the long format.
If you search on here for overlaps with date intervals you'll find many examples of exactly how this can be done.
If you want it specific for your exact use case, perhaps ask that question specifically.

https://communities.sas.com/t5/SAS-Programming/Duration-in-Business-days-and-overlapping-days/m-p/47...

https://communities.sas.com/t5/SAS-Programming/Finding-overlapping-durations/m-p/298268?search-actio...

https://communities.sas.com/t5/New-SAS-User/Collapsing-rows-by-overlapping-dates/m-p/622561?search-a...

Reeza
Super User
FYI - have worked with both research data and administrative data that's similar - kids in care or homeless shelters or health care drug analysis where prescription data needs to be mapped to adverse events. Very similar to your use cases and data was always long. The initial wide approach is very common with social scientists though 😉
ModeratelyWise
Obsidian | Level 7

Thanks @Reeza.  I will probably have to create a new thread as I plan on using @ballardw 's code to make the dataset long and have a row for each day for each client event.   Which I imaging is going to be very long indeed. 

Reeza
Super User
Yeah, I used to explode it out to a day each as well but have found some interesting ways around this that are more efficient.

One is to mark each entry/exit and then add/subtract as you go through events instead which was a technique I learned from R to do this but haven't had a chance to try it out in SAS. It's significantly more efficient than creating a record for each day.

Instead, I would suggest starting a new thread with what you're trying to accomplish and letting some of the people here suggest approaches that will work? Create some small datasets that you can share that illustrate your problem and what you want to achieve and post those to get a really good answer quickly. Coming up with that REPREX (reproduceable example) step is key to this process. 80% of the time when I'm doing this, I usually figure out the solution before I post it.
ballardw
Super User

@ModeratelyWise wrote:

@ballardw 

 

Thanks it works just like you said. Do you think I can attain the output I talk about in my above comment with a long data set?   


If you add a "flag" numeric value you can make a report that reads like that very easily.

data want;
   set have;
   do ddate = begin_date to End_date;
      flag=1;
      output;
   end;
   format ddate date9.;
run;

proc tabulate data=want;
  class  Client_ID  Episode_ID  Event_ID Event_Plcmnt  ddate;
  var flag;
  table Client_ID*Episode_ID*  Event_ID*Event_Plcmnt,
      ddate=' '*flag=' '*sum=' '*f=best5.
      /misstext=' ' 
   ;
run;

Though with the extreme number of dates likely to result I might suggest making that the row variable and the other variables the column variables (unless you have a lot of Client_id values in which case the result will be ugly pretty much whatever you do).

The =' ' are suppressing variable names or statistic names from the report. The f=best5. is showing the "sum" without decimals which is the default for the procedure. The /misstext option sets what to show for missing values of the sum, in this case a simple blank, so you do not see mostly cells with a dot for value.

 

More variables might make Proc Report more attractive with the DDATE as an across variable. Depends a lot on actual data.

 

Without some serious justification I won't even attempt to create a data set.

ModeratelyWise
Obsidian | Level 7

Thanks @ballardw!  I have thousands of IDs each with tens of events...so this is going to be ugly any way I look at it. But if I do the long method you say I should be able to trim overlaps and fill in gaps without going to wide?  So I will use your original code without the tabular format.  Would you object, if I have problems trying to fix gaps and overlaps, if I ask for further help?  Should I just post it here or create a new thread?

 

Best,

 

Marc 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1979 views
  • 4 likes
  • 4 in conversation