A Practical Application of Clustering methods in detecting emerging patterns in sequential data

6 Likes

Background

Data visualization is a useful tool for pattern discovery, but it is both time-consuming and challenging for clients to analyze thousands of sequential datasets visually. Instead, applying Clustering methods, from the machine learning toolkit is the next step in Analytics maturity. Using these methods enables automation, thereby saving time.

The use of more sophisticated data mining techniques such as Regression models or Markov Hidden Models (HMMs) are viable options, however, such approaches require a large sample size and higher computational power. Clearly, these elements can be problematic for clients at the early stages of the analytics maturity curve. For many clients, prior understanding of data visualization methods, combined with this Clustering technique provides results that are transparent and interpretable.

The case study below is used to illustrate the approach. The dataset is of various types of crime, with the client interested in the increasing trend in crime types. This method is also applicable to other patterns searches such as decreasing trends or no trends.

A Use Case: To Detect trends amongst 30 Crime Types

Suppose we have been given a dataset on 30 different crime types by incident count recorded for the last 12 months. The business question is to identify which crime types that are trending upwards.

Visually analyzing the crimes shown in Table 1 and Table 2, it is hard to tell which crime type is trending.

Table 1. Crime Data Table

Table 2. 30 crime time series in a Line Chart

Methodology

(1) Include a synthetic observation with predefined pattern

Firstly, we add to the crime dataset an observation with predefined trending upwards pattern (See Table 3). The reason this synthetic observation is included, is to define the cluster with a meaningful description.

It must have numerical values of each of the twelve months (1-12). Zero is allowed.

Table 3. A synthetic observation ‘Upward trend pattern’ added to Crime dataset

(2) Data Standardization

In this example, the input variables for the clustering analysis are the twelve months. These input variables must be standardized before carrying out Clustering analysis. Standardization helps to prevent variables with larger scale dominate how the clusters are defined. For this, the Z-score was used. This is a popular method to standardize data.

(3) Generate Clustering Analysis

Run several Cluster models until an optimal model is found using “Proc cluster” and “Proc tree” procedure. (See link in References below)

(4) Tag cluster as “Trending Upwards” based on Synthetic observation

By association, we could reason that the crimes belonging to the same cluster as the synthetic observation also share a similar time series pattern, hence these crime activities are also trending upwards.

Flag the clusters as either ‘Trending Up’ based on the synthetic observation in the same cluster.

(4) Upload to Visual Analytics for Visualization

See the Results.

Results

The clustering methods categorize crime types into groups (clusters) based on pattern similarities.

The synthetic ‘Upward trend pattern observation’ is in the same cluster as other crime types with upward trends such as gambling, offenses involving children, public indecency.

Table 4. A list of crimes with upwards trends

Conclusion

Clustering method is a powerful technique for detecting emerging trends in large sequential time series data.

Its application can fast track trend discovery process, at the same time, save time compared to performing analyzes on thousands of sequential data visually and manually.

References

lily_clarke · ‎09-27-2023

Fantastic work @SooS. Great and easy to follow method!

rogerward · ‎09-27-2023

This is very clever, and has a lot applied uses. How do you calculate the Upward Trend Pattern Observation?