How do you typically aggregate event data for analysis? Event data -- for example, a death, heart attack, or sale of vacation package -- is often aggregated by frequencies and frequency ratios by one or more dimensions. This tip shows how to derive these for analysis in SAS Enterprise Miner. Both SAS Data Step and SAS SQL Procedure code are presented.
My data for this tip is the deaths in Mexico in 2008 (DEATHS Dataset). Each row in this data contains information about a death: Cause, place, demographics, time, etc.
I also have a lookup dataset which contains the cause of death descriptions (DISEASE Dataset).
I want to derive the aggregated frequencies of COD (Cause of Death) by HOD (Hour of Death), and the ratio of those frequencies to the overall respective COD frequencies. First, I summarize the data by COD and HOD, and at higher-level aggregations:
This results in output table CODSUM, which contains the aggregated frequencies at four levels: Overall count, HOD counts, COD counts, and counts of COD by HOD. Below is a subset of the counts by COD (_type_=2), and counts by COD and HOD (_type_=3). The overall count is not shown, and the counts by HOD are not shown.
Now I join CODSUM with itself so that the frequency ratio column PROP can be calculated. This is done by joining the counts of COD by HOD with the counts of COD, and calculating the ratio between the two. The COD descriptions are also brought in by making a join to DISEASE.
The output table HOD2 contains the COD by HOD counts (FREQ) along with the calculated PROP ratio values, and the COD descriptions. This is a small subset of that table.
Another useful frequency ratio is the ratio of the overall HOD frequencies to the overall data count (PROPALL_HOD). Further statistical analysis can be accomplished in the correlation of this ratio and the PROP ratio. This additional ratio is derived by joining the overall HOD counts with the total count, and calculating the ratio between the two.
The output table HOD2 now contains the additional overall HOD counts (FREQALL_HOD) along with the calculated PROPALL_HOD ratio values.
The code shown in this tip so far has used the SQL Procedure for joins and calculations of the ratios. Below I show Data Step code that can be used to accomplish the same result.
I’ve taken individual Mexico deaths data and used SAS to derive the frequency counts of COD by HOD and two associated COD and HOD frequency ratios. These ratios are useful for further analysis in SAS Enterprise Miner. My colleague Ray Wright, as presented in the “Getting Started with Time Series Clustering” SAS Communities Tip, takes the hourly PROP values by COD and uses SAS Enterprise Miner to cluster the COD Time Series plots into five groups. The assignment of each death to one of these Time Series groups can then be used as input in a predictive model.
For more about the mortality data used in this tip, see:
Wickham, H. (2014). Tidy Data. Journal of Statistical Software, Vol 59.