Data visualization is a useful tool for pattern discovery, but it is both time-consuming and challenging for clients to analyze thousands of sequential datasets visually. Instead, applying Clustering methods, from the machine learning toolkit is the next step in Analytics maturity. Using these methods enables automation, thereby saving time.
The use of more sophisticated data mining techniques such as Regression models or Markov Hidden Models (HMMs) are viable options, however, such approaches require a large sample size and higher computational power. Clearly, these elements can be problematic for clients at the early stages of the analytics maturity curve. For many clients, prior understanding of data visualization methods, combined with this Clustering technique provides results that are transparent and interpretable.
The case study below is used to illustrate the approach. The dataset is of various types of crime, with the client interested in the increasing trend in crime types. This method is also applicable to other patterns searches such as decreasing trends or no trends.
Suppose we have been given a dataset on 30 different crime types by incident count recorded for the last 12 months. The business question is to identify which crime types that are trending upwards.
Visually analyzing the crimes shown in Table 1 and Table 2, it is hard to tell which crime type is trending.
Table 1. Crime Data Table
Table 2. 30 crime time series in a Line Chart
Firstly, we add to the crime dataset an observation with predefined trending upwards pattern (See Table 3). The reason this synthetic observation is included, is to define the cluster with a meaningful description.
It must have numerical values of each of the twelve months (1-12). Zero is allowed.
Table 3. A synthetic observation ‘Upward trend pattern’ added to Crime dataset
In this example, the input variables for the clustering analysis are the twelve months. These input variables must be standardized before carrying out Clustering analysis. Standardization helps to prevent variables with larger scale dominate how the clusters are defined. For this, the Z-score was used. This is a popular method to standardize data.
Run several Cluster models until an optimal model is found using “Proc cluster” and “Proc tree” procedure. (See link in References below)
By association, we could reason that the crimes belonging to the same cluster as the synthetic observation also share a similar time series pattern, hence these crime activities are also trending upwards.
Flag the clusters as either ‘Trending Up’ based on the synthetic observation in the same cluster.
See the Results.
The clustering methods categorize crime types into groups (clusters) based on pattern similarities.
The synthetic ‘Upward trend pattern observation’ is in the same cluster as other crime types with upward trends such as gambling, offenses involving children, public indecency.
Table 4. A list of crimes with upwards trends
Clustering method is a powerful technique for detecting emerging trends in large sequential time series data.
Its application can fast track trend discovery process, at the same time, save time compared to performing analyzes on thousands of sequential data visually and manually.
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473724.htm
Fantastic work @SooS. Great and easy to follow method!
This is very clever, and has a lot applied uses. How do you calculate the Upward Trend Pattern Observation?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.