## Model-Based Clustering (Part 1): Exploring Its Significance

Started 2 weeks ago by
Modified 2 weeks ago by
Views 256

The traditional goal of clustering has been to identify groups of observations that are similar by some measurement. Partitive clustering, also known as optimization clustering, achieve this primarily through heuristic methods, by using a measurement such as the distance to a center or a boundary. They essentially divide a data set into clusters by trying to optimize some specified criterion. For example, k-means clustering minimizes within-cluster sum of squares, so that the members of the cluster are homogeneous and maximizes between-cluster sum of squares to get the clusters well separated. Partitive clustering methods, such as k-means, are typically considered "hard" clustering algorithms, as each data point is assigned exclusively to one cluster, including outliers. These methods scale linearly with the number of observations, making them the only practical choice when dealing with a large dataset.

Partitive clustering methods suffer from significant limitations due to their heuristic nature rather than being grounded in formal or strict statistical models. They have drawbacks such as the need to guess the number of clusters, sensitivity to the initial location of the reference vectors (seeds), outliers, and even the order in which the observations are read, and assumptions about cluster shapes (often spherical). Additionally, determining the best partition of observations is computationally infeasible due to the exponentially growing number of possible partitions.

Model-Based Clustering and Why to Use it?

Instead of taking a heuristic approach to build a cluster, model-based clustering uses a probability-based approach. Model-based clustering assumes that the data is generated by an underlying probability distribution and tries to recover the distribution from the data. In other words, similarity is determined by the likelihood, not the distance. This likelihood is derived from a finite mixture model. A finite mixture model has mixture components, and these mixture components represent the clusters.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Each cluster is modeled by a distinct probability distribution (such as multivariate Gaussian or uniform distributions), and the parameters of these distributions (like means and covariances) are estimated from the data using techniques such as the Expectation-Maximization (EM) algorithm to maximize the likelihood.

This all sounds interesting, but why use model-based clustering?

Model-based clustering is a formal model based on a likelihood, so you have well-defined statistical properties for formal inference, and you can use well-established statistical methods for model selection.

Also, the clustering can be "soft" - this means that each clustering has a measure of strength, rather than being a rigid classification. This typically involves estimating fit probabilities, reflecting the degree of belongingness to each cluster, for each of the observations. This will permit observations to belong to multiple clusters simultaneously. This soft assignment provides more nuanced insights into the structure of the data, especially in scenarios where observations exhibit characteristics that align with more than one cluster.

Observations which fit the distribution well will have high fit probabilities, whereas anomalies will have very low fit probabilities. This will allow to identify outliers that might not be members of any of the clusters found.

Another key advantage is that the model-based approach employs statistical criteria to suggest the optimal the number of clusters and the most likely model.

Finally, this approach allows for flexible clustering that can accommodate different shapes and sizes of clusters. The form of each mixture component can vary in shape from the others, meaning that you do not need to be restricted to one cluster shape.

A Simple Example

What does this all look like? Here's a simple example sourced from Dave Kessler's video on Model-Based Clustering with PROC MBC - SAS Video Portal.

In the analysis data, you see two clusters where there is strong correlation within each cluster; that is the "X" shape. You also see some points outside the clusters, which represent some amount of background noise.

If you use a traditional clustering method like k-means, what will happen?  k-means quickly does what it was built to do. It has resulted in a 5-cluster solution. The shape and color of the marker indicates the cluster to which k-means assigned that point, and the dark markers indicate the center of different clusters. The challenge here is that k-means is designed to find the best circular clusters in the data, and the clusters you see are clearly not circular. Now see how model-based clustering approaches this problem!

As in the k-means clustering plot, the marker symbol and color indicate which cluster each point was assigned to. There are clearly two multivariate Gaussian clusters. These results conform to your impression of two clusters, each with strong correlation, and some background noise.

The output of the model-based clustering can include soft clustering information in addition to hard assignments. You can store your fitted models to apply to new data sets. The output contains weights of association between each observation and each cluster. In the above plot, the color of the point is darker for stronger associations. When you look at the noise weight values, which indicates strength of association with the noise cluster, they are highest outside of the "X" shape. When you look at the cluster1 and cluster2 weights, in areas where two clusters cross, the strength of association for any cluster is lower, producing gray areas.

Model-Based Clustering in SAS Viya

There are various algorithms available for conducting model-based clustering, with Gaussian mixture models being among the most widely used probabilistic models for this purpose. In Gaussian mixture models, it is assumed that the data is generated from a mixture of several Gaussian distributions.

In SAS Viya, Gaussian mixture models can be fitted using either the MBC or GMM procedures. PROC MBC and PROC GMM are both model-based clustering methods, but they differ in their approach and capabilities.

PROC MBC supports two types of Gaussian mixtures distinguished by their use of latent factors in the model. The models that do not use latent factors are referred to as Gaussian mixture models (GMMs), and the models that use latent factors are referred to as Parsimonious Gaussian mixture models (PGMMs).

On the other hand, PROC GMM fits a nonparametric Bayesian Gaussian mixture model, which can be regarded as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians.

The MBC procedure is part of SAS® Visual Statistics and the GMM procedure is part of SAS® Machine Learning.

GMMs with the EM algorithm require specifying the number of clusters in advance and fit a fixed number of Gaussian components, whereas nonparametric Bayesian Gaussian mixture models allow for a more flexible and data-driven determination of the number of mixture components, accommodating varying complexities in the underlying data distribution.

Concluding Remarks

Model-based clustering, such as the Gaussian Mixture Model and the nonparametric Bayesian Gaussian mixture model, is a type of soft clustering method that assumes data is generated from a mixture of different distributions, often normal distributions. This method generalizes k-means clustering to include information about the covariance of the data and the centers of the latent Gaussian models. The number of clusters is determined using processes like the Dirichlet process. This method is particularly useful in data exploration scenarios. It's important to note that the effectiveness of model-based clustering can be influenced by the scale of the variables, with variables having larger variances tending to have more influence on cluster formation.

The subsequent sections of this series Model-Based Clustering (Part 2): A Detailed Look at the MBC Procedure and Model-Based Clustering (Part 3): The GMM Procedure Demystified will cover the implementation of PROC MBC and PROC GMM.

Find more articles from SAS Global Enablement and Learning here.