Solved: Re: Creating date-based arrays

ModeratelyWise · Posted 10-12-2021 03:19 PM

Hello all,

So I have a dataset of clients with one or more social service episodes and within each episode there can be one or more events. Each event has a begin date and an end date. I want to make a wide dataset so that there is a column for each day for a given time period, for example, calendar year 2019. And for each event I want to flag the days that occur during the given time period. There is more that I want to do but this would be a great start.

Here is an example data set:

data have;
infile datalines;
input Client_ID $ Episode_ID $ Event_ID $ Begin_Date :date9. End_Date :date9. Event_Plcmnt $;
format Begin_date End_Date date9.;
datalines;
001	001	001	20MAY2019	23MAY2019	GC
001	001	002	22MAY2019	23MAY2019	OTHER
001	001	003	23MAY2019	23OCT2019	GC
001	001	004	29JUN2019	30JUN2019	OTHER
001	001	006	14JUL2019	15JUL2019	OTHER
001	001	007	19JUL2019	23JUL2019	OTHER
001	001	005	25JUL2019	08AUG2019	OTHER
001	001	008	09AUG2019	10AUG2019	OTHER
001	001	009	11AUG2019	12AUG2019	OTHER
001	002	010	01NOV2019	16APR2020	URC

And here is what I want it to look like:

Client_ID	Episode_ID	Event_ID	Begin_Date	End_Date	Event_Plcmnt	01JAN2019	02JAN2019	...	20MAY2019	21MAY2019	22MAY2019	23MAY2019	24MAY2019	25MAY2019	...	30DEC2019	31DEC2019
001	001	001	20May2019	23May2019	GC			...	1	1	1	1			...
001	001	002	22May2019	23May2019	Other			...			1	1			...
001	001	003	23May2019	23Oct2019	GC			...				1	1	1	...
001	001	004	29Jun2019	30Jun2019	Other			...							...
001	001	005	14Jul2019	15Jul2019	Other			...							...
001	001	006	19Jul2019	23Jul2019	Other			...							...
001	001	007	25Jul2019	8Aug2019	Other			...							...
001	001	008	09Aug2019	10Aug2019	Other			...							...
001	001	009	11Aug2019	12Aug2019	Other			...							...
001	002	010	01Nov2019	16Apr2020	URC			...							...	1	1

I've truncated, of course, to save space. Also, there are a number of other columns that I did not include in the sample dataset that I would need to carry over.

Let me know what you all think.

Best,

Marc

ballardw · Posted 10-12-2021 03:44 PM

If all of your client ids were involved with the same date ranges there might be some limited sense in a "wide" format but one suspects that your actual data doesn't meet that requirement.

Making a long set is easy:

data want;
   set have;
   do ddate = begin_date to End_date;
      output;
   end;
   format ddate date9.;
run;

DDate has each date in the given intervals. it would be very easy to add a flag variable to the above.

How to use the above depends on what you are actually trying to accomplish. Having a data set with hundreds of "flag" variables is seldom easy to work with.

View solution in original post

PaigeMiller · Posted 10-12-2021 03:24 PM

So I have a dataset of clients with one or more social service episodes and within each episode there can be one or more events. Each event has a begin date and an end date. I want to make a wide dataset ...

Stop right there. You should not make this a wide dataset. There are almost no cases where you really need to do this, and it makes the rest of your programming much more difficult. Of course, to be 100% sure, we need to know what analysis or report you plan to do with this wide data set; most of the time, the same analysis or report can be obtain with much less work by NOT making a wide data set.

See Maxim 19: Long beats wide.

--
Paige Miller

ModeratelyWise · Posted 10-12-2021 04:20 PM

Thanks @PaigeMiller ,

Thanks for the response. Interesting, I was told to try arrays by someone at my work.

Here is the issue. I am trying to calculate the number of occurrences of maltreatment that occur per 100,000 days in care for clients who are in one of five placement settings (the event_plcmnt column). To do this I need to total the number of days that clients spend in one of these settings. It would be nice if the data were clean and straight forward but it isn't. The dates between events often overlap or have gaps and sometimes the dates for events are nested inside the dates for another event in the same episode. What I need to do is find a way to fix the gaps and overlaps and fix them and then join with table of maltreatment occurrence dates to see where the client was when the maltreatment occurred. Finally I total the number of days for each placement setting for all the clients and the number of maltreatments in each of the five setting and divide the latter by the former to get the rates. Can I do this with the long format?

PaigeMiller · Posted 10-12-2021 05:46 PM

@ModeratelyWise wrote:

Thanks @PaigeMiller ,

Thanks for the response. Interesting, I was told to try arrays by someone at my work.

Here is the issue. I am trying to calculate the number of occurrences of maltreatment that occur per 100,000 days in care for clients who are in one of five placement settings (the event_plcmnt column). To do this I need to total the number of days that clients spend in one of these settings. It would be nice if the data were clean and straight forward but it isn't. The dates between events often overlap or have gaps and sometimes the dates for events are nested inside the dates for another event in the same episode. What I need to do is find a way to fix the gaps and overlaps and fix them and then join with table of maltreatment occurrence dates to see where the client was when the maltreatment occurred. Finally I total the number of days for each placement setting for all the clients and the number of maltreatments in each of the five setting and divide the latter by the former to get the rates. Can I do this with the long format?

All can be done with a long data set, code provided by @ballardw . The method of cleaning up depends on your knowledge of the data, and so you have to decide what gaps to fix, what overlaps to fix, and what is the proper method of fixing them, but the coding which we can help you with is much simpler with a long data set.

--
Paige Miller

ballardw · Posted 10-12-2021 03:44 PM

If all of your client ids were involved with the same date ranges there might be some limited sense in a "wide" format but one suspects that your actual data doesn't meet that requirement.

Making a long set is easy:

data want;
   set have;
   do ddate = begin_date to End_date;
      output;
   end;
   format ddate date9.;
run;

DDate has each date in the given intervals. it would be very easy to add a flag variable to the above.

How to use the above depends on what you are actually trying to accomplish. Having a data set with hundreds of "flag" variables is seldom easy to work with.

ModeratelyWise · Posted 10-12-2021 04:22 PM

@ballardw

Thanks it works just like you said. Do you think I can attain the output I talk about in my above comment with a long data set?

Reeza · Posted 10-12-2021 04:49 PM

Yes, you can 100% do what you want with the long format.
If you search on here for overlaps with date intervals you'll find many examples of exactly how this can be done.
If you want it specific for your exact use case, perhaps ask that question specifically.

https://communities.sas.com/t5/SAS-Programming/Duration-in-Business-days-and-overlapping-days/m-p/47...

https://communities.sas.com/t5/SAS-Programming/Finding-overlapping-durations/m-p/298268?search-actio...

https://communities.sas.com/t5/New-SAS-User/Collapsing-rows-by-overlapping-dates/m-p/622561?search-a...

Reeza · Posted 10-12-2021 04:52 PM

FYI - have worked with both research data and administrative data that's similar - kids in care or homeless shelters or health care drug analysis where prescription data needs to be mapped to adverse events. Very similar to your use cases and data was always long. The initial wide approach is very common with social scientists though 😉

ModeratelyWise · Posted 10-12-2021 07:07 PM

Thanks @Reeza. I will probably have to create a new thread as I plan on using @ballardw 's code to make the dataset long and have a row for each day for each client event. Which I imaging is going to be very long indeed.

Reeza · Posted 10-12-2021 07:14 PM

Yeah, I used to explode it out to a day each as well but have found some interesting ways around this that are more efficient.

One is to mark each entry/exit and then add/subtract as you go through events instead which was a technique I learned from R to do this but haven't had a chance to try it out in SAS. It's significantly more efficient than creating a record for each day.

Instead, I would suggest starting a new thread with what you're trying to accomplish and letting some of the people here suggest approaches that will work? Create some small datasets that you can share that illustrate your problem and what you want to achieve and post those to get a really good answer quickly. Coming up with that REPREX (reproduceable example) step is key to this process. 80% of the time when I'm doing this, I usually figure out the solution before I post it.

ballardw · Posted 10-12-2021 04:51 PM

@ModeratelyWise wrote:

@ballardw

Thanks it works just like you said. Do you think I can attain the output I talk about in my above comment with a long data set?

If you add a "flag" numeric value you can make a report that reads like that very easily.

data want;
   set have;
   do ddate = begin_date to End_date;
      flag=1;
      output;
   end;
   format ddate date9.;
run;

proc tabulate data=want;
  class  Client_ID  Episode_ID  Event_ID Event_Plcmnt  ddate;
  var flag;
  table Client_ID*Episode_ID*  Event_ID*Event_Plcmnt,
      ddate=' '*flag=' '*sum=' '*f=best5.
      /misstext=' ' 
   ;
run;

Though with the extreme number of dates likely to result I might suggest making that the row variable and the other variables the column variables (unless you have a lot of Client_id values in which case the result will be ugly pretty much whatever you do).

The =' ' are suppressing variable names or statistic names from the report. The f=best5. is showing the "sum" without decimals which is the default for the procedure. The /misstext option sets what to show for missing values of the sum, in this case a simple blank, so you do not see mostly cells with a dot for value.

More variables might make Proc Report more attractive with the DDATE as an across variable. Depends a lot on actual data.

Without some serious justification I won't even attempt to create a data set.

ModeratelyWise · Posted 10-12-2021 07:04 PM

Thanks @ballardw! I have thousands of IDs each with tens of events...so this is going to be ugly any way I look at it. But if I do the long method you say I should be able to trim overlaps and fill in gaps without going to wide? So I will use your original code without the tabular format. Would you object, if I have problems trying to fix gaps and overlaps, if I ask for further help? Should I just post it here or create a new thread?

Best,

Marc

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!