BookmarkSubscribeRSS Feed

A Practical Application of Clustering methods in detecting emerging patterns in sequential data

Started ‎07-25-2023 by
Modified ‎07-25-2023 by
Views 906

Background

 

Data visualization is a useful tool for pattern discovery, but it is both time-consuming and challenging for clients to analyze thousands of sequential datasets visually. Instead, applying Clustering methods, from the machine learning toolkit is the next step in Analytics maturity. Using these methods enables automation, thereby saving time.

 

The use of more sophisticated data mining techniques such as Regression models or Markov Hidden Models (HMMs) are viable options, however, such approaches require a large sample size and higher computational power. Clearly, these elements can be problematic for clients at the early stages of the analytics maturity curve. For many clients, prior understanding of data visualization methods, combined with this Clustering technique provides results that are transparent and interpretable.

 

The case study below is used to illustrate the approach. The dataset is of various types of crime, with the client interested in the increasing trend in crime types. This method is also applicable to other patterns searches such as decreasing trends or no trends.

 

A Use Case: To Detect trends amongst 30 Crime Types

 

Suppose we have been given a dataset on 30 different crime types by incident count recorded for the last 12 months.  The business question is to identify which crime types that are trending upwards.

 

Visually analyzing the crimes shown in Table 1 and Table 2, it is hard to tell which crime type is trending.

 

Table 1. Crime Data Table

SooS_0-1690264422799.png

 

 Table 2. 30 crime time series in a Line Chart

 

SooS_0-1689821874494.png

 

Methodology

(1) Include a synthetic observation with predefined pattern

 

Firstly, we add to the crime dataset an observation with predefined trending upwards pattern (See Table 3). The reason this synthetic observation is included, is to define the cluster with a meaningful description.

 

It must have numerical values of each of the twelve months (1-12). Zero is allowed.

 

Table 3. A synthetic observation ‘Upward trend pattern’ added to Crime dataset

SooS_1-1690264549048.png

 

(2) Data Standardization

 

In this example, the input variables for the clustering analysis are the twelve months. These input variables must be standardized before carrying out Clustering analysis. Standardization helps to prevent variables with larger scale dominate how the clusters are defined.  For this, the Z-score was used. This is a popular method to standardize data.


 (3) Generate Clustering Analysis

 

Run several Cluster models until an optimal model is found using “Proc cluster” and “Proc tree” procedure. (See link in References below)

 

(4) Tag cluster as “Trending Upwards” based on Synthetic observation

 

By association, we could reason that the crimes belonging to the same cluster as the synthetic observation also share a similar time series pattern, hence these crime activities are also trending upwards.

Flag the clusters as either ‘Trending Up’ based on the synthetic observation in the same cluster.

 

(4) Upload to Visual Analytics for Visualization

 

See the Results.

 

Results

The clustering methods categorize crime types into groups (clusters) based on pattern similarities.

 

The synthetic ‘Upward trend pattern observation’ is in the same cluster as other crime types with upward trends such as gambling, offenses involving children, public indecency.

 

Table 4. A list of crimes with upwards trends

SooS_2-1690264714709.png

 

Conclusion

Clustering method is a powerful technique for detecting emerging trends in large sequential time series data.

 

Its application can fast track trend discovery process, at the same time, save time compared to performing analyzes on thousands of sequential data visually and manually.

 

References

Comments

Fantastic work @SooS. Great and easy to follow method!

This is very clever, and has a lot applied uses.  How do you calculate the Upward Trend Pattern Observation?

Version history
Last update:
‎07-25-2023 02:22 AM
Updated by:
Contributors

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags