When you work with data measured over time, it is sometimes useful to group the time series. Time Series Clustering (TSC) can be used to find stocks that behave in a similar way, products with similar sales cycles, or regions with similar temperature profiles.
TSC can also help you incorporate time series in traditional data mining applications such as customer churn prediction and fraud identification. For example, suppose you have a hunch that customer’s behavior over time would help predict churn or fraud. How would you incorporate the temporal pattern as a predictor in your model, where the unit of analysis is the customer? You can achieve this by categorizing each of the original series and using the category labels as inputs in your predictive model:
Of course you could eyeball the series’ shapes and categorize them manually into, say, flat series, shifty series, trending series, and so on. But that would be tedious, if not impractical, if you have more than a few series. This tip shows how to automate series labeling using clustering techniques in SAS® Enterprise Miner™.
The data for this tip are 2008 individual mortality data from Mexico (Wickham, 2014). The dataset includes the following variables:
The raw dataset consists of time profiles for 1194 diseases, stacked one on top of the other:
Here is a plot of the time profile for the first disease, Acute Bronchitis:
This simple process flow prepares the mortality data for time series analysis, then clusters the diseases based on their time profiles.
Input Data Source Node (HOD2)
I defined variable metadata as follows:
SAS Code Node
I used SAS code to filter rare diseases. 214 diseases remained after filtering.
TS Data Preparation Node
This node transforms the input data to a proper time series format for similarity analysis. In the transposed dataset, each disease profile time series becomes an input variable. I set node properties as follows:
This is a snippet of the exported data:
Transposed Time Series (Partial)
Notice that we now have a time series dataset: the rows represent ordered, equally spaced time points; there is one column per disease; and the values are proportion of deaths for a given disease at a given time of day.
TS Similarity Node
I used the following node properties:
The node first creates a similarity matrix, a representation of the similarity/distance of each pair of individual series. Then it uses hierarchical clustering to group similar diseases. After some initial exploration, I settled on a 5-cluster solution.
The node produces familiar clustering output such as a constellation plot and dendrogram. But in addition to the graphical output, the node reports the cluster assignment for each series. I merged this series-to-cluster map to my raw input data for the plots below.
Before drilling into the clusters, let’s examine some of the elements in the similarity matrix computed by the TS Similarity node. I created all of the plots with SAS and Proc SGPANEL.
Here are the two pairs (diseases) with the greatest similarity (i.e., the smallest sum of squared difference across time points).
Most similar series
Both series represent types of drowning. Since they have greatest similarity among any pair of diseases, these are the first two series joined by the hierarchical clustering algorithm.
For comparison, here are the two most dissimilar series (i.e., diseases with the largest sum of squared differences across time):
Most dissimilar series
Visually, these series are certainly easy to distinguish. One is U-shaped, with big variation over time. The other is flatter with an inverted U-shape.
To explore the five clusters, I plotted the overall time profile by cluster (in TSC, these time profiles represent the cluster centers):
Here Cluster 2 stands out, not only because mortality varied greatly with time compared to the other clusters, but also because it was the only cluster with a U-shaped pattern. In other words, Cluster 2 seems to represent deaths that spike late at night into the early morning, and rarely occur during the day.
Drilling into Cluster 2, here are the first six diseases in the cluster:
In general, these diseases reflect the aforementioned U pattern, with high incidence at night. Homicides and suicides dominate this cluster.
In contrast, series in cluster 1 tend to have an inverted U pattern: they occur primarily during daytime hours (drownings, electrocutions) or in the early morning (SIDS).
As you can see, getting started with Time Series Clustering is easy with SAS Enterprise Miner. If the ultimate goal were a predictive model, such as a model to predict the economic impact of each, we could now use the derived clusters as inputs.
Because the TS Similarity node gives you the option to export a distance matrix, you can use any EM clustering node for clustering time series. Just choose the Output Data Set: Distance Matrix option in node properties before running the TS Similarity node. I obtained the distance matrix the same way for two of the above plots.
The TS Similarity node can also be used to identify series that are similar to a target/reference series. For example, to identify financial behavior that is similar to known abusive or fraudulent behavior. While not an issue with the mortality data used in this tip, the node can also handle series of varying lengths via time warping.
For more about the mortality data used in this tip, see:
Wickham, H. (2014). Tidy Data. Journal of Statistical Software, Vol 59.
And here is a great resource on Time Series Data Mining in SAS Enterprise Miner: