We’re smarter together. Learn from this collection of community knowledge and add your expertise.

How to aggregate event data for SAS Enterprise Miner

by SAS Employee chmedi on ‎05-11-2016 03:45 PM (1,421 Views)

How do you typically aggregate event data for analysis? Event data -- for example, a death, heart attack, or sale of vacation package -- is often aggregated by frequencies and frequency ratios by one or more dimensions. This tip shows how to derive these for analysis in SAS Enterprise Miner. Both SAS Data Step and SAS SQL Procedure code are presented.

 

My data for this tip is the deaths in Mexico in 2008 (DEATHS Dataset). Each row in this data contains information about a death: Cause, place, demographics, time, etc.

 

mexicoDeaths.png

 

 

 

 

 

I also have a lookup dataset which contains the cause of death descriptions (DISEASE Dataset).

 

 

codCodes.png

 

 

 

 

 

I want to derive the aggregated frequencies of COD (Cause of Death) by HOD (Hour of Death), and the ratio of those frequencies to the overall respective COD frequencies. First, I summarize the data by COD and HOD, and at higher-level aggregations:

 

procSummary.png

 

 

 

 

This results in output table CODSUM, which contains the aggregated frequencies at four levels: Overall count, HOD counts, COD counts, and counts of COD by HOD.  Below is a subset of the counts by COD (_type_=2), and counts by COD and HOD (_type_=3).  The overall count is not shown, and the counts by HOD are not shown.

 

codsum.png

 

 

 

 

 

Now I join CODSUM with itself so that the frequency ratio column PROP can be calculated. This is done by joining the counts of COD by HOD with the counts of COD, and calculating the ratio between the two.  The COD descriptions are also brought in by making a join to DISEASE.

 

codJoin.png

 

 

 

 

The output table HOD2 contains the COD by HOD counts (FREQ) along with the calculated PROP ratio values, and the COD descriptions. This is a small subset of that table.

 

codout.png

 

 

 

 

 

Another useful frequency ratio is the ratio of the overall HOD frequencies to the overall data count (PROPALL_HOD). Further statistical analysis can be accomplished in the correlation of this ratio and the PROP ratio.  This additional ratio is derived by joining the overall HOD counts with the total count, and calculating the ratio between the two.

 

codJoin2.png

 

 

 

 

The output table HOD2 now contains the additional overall HOD counts (FREQALL_HOD) along with the calculated PROPALL_HOD ratio values.

 

 

codout2.png

 

 

 

 

 

The code shown in this tip so far has used the SQL Procedure for joins and calculations of the ratios. Below I show Data Step code that can be used to accomplish the same result.

 

coddatastep.png

 

 

 

Conclusion

 

I’ve taken individual Mexico deaths data and used SAS to derive the frequency counts of COD by HOD and two associated COD and HOD frequency ratios. These ratios are useful for further analysis in SAS Enterprise Miner.  My colleague Ray Wright, as presented in the “Getting Started with Time Series Clustering” SAS Communities Tip, takes the hourly PROP values by COD and uses SAS Enterprise Miner to cluster the COD Time Series plots into five groups.  The assignment of each death to one of these Time Series groups can then be used as input in a predictive model.

 

 

Further reading

 

For more about the mortality data used in this tip, see:

Wickham, H. (2014). Tidy Data. Journal of Statistical Software, Vol 59.

Contributors
Your turn
Sign In!

Want to write an article? Sign in with your profile.


Looking for the Ask the Expert series? Find it in its new home: communities.sas.com/askexpert.