Model-Based Clustering (Part 3): The GMM Procedure Demystified

In earlier installments of this series, Model-Based Clustering (Part 1): Exploring Its Significance, we explored the concept of model-based clustering, highlighting its benefits and examining the various models offered by SAS Viya. Then, in Model-Based Clustering (Part 2): A Detailed Look at the MBC Procedure, we delved into the practical steps of implementing model-based clustering using the MBC procedure.

We will now apply PROC GMM to analyze the Census2000 dataset, which provides a summary of the 2000 United States Census at the postal code level. Recall that the goal of the analysis is to group geographical regions in the United States into distinct subsets based on three demographic factors: Average household size in the region (MeanHHSz), Median household income in the region (MedHHInc), and Region population density percentile (1=lowest density, 100=highest density) (RegDens). These factors are common to commercial lifestyle and life-stage segmentation products.

Although model-based clustering can be executed in SAS Viya solely through code, this doesn't prevent us from leveraging it within the Model Studio pipeline interface. Despite the lack of a dedicated node for GMM, similar to the MBC procedure, we seamlessly integrated a SAS Code node into a pipeline, as illustrated below.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The importance of incorporating a Data Exploration node, Filtering node, and Clustering node was previously elucidated and discussed in the preceding installment of this series.

Fitting A Nonparametric Bayesian Gaussian Mixture Model

Now, let's fit a nonparametric Bayesian Gaussian mixture model. Following code is used in the GMM SAS Code node:

/* Executing Nonparametric Bayesian Gaussian mixture model using GMM procedure */

proc gmm data=&dm_data

seed=12345 maxClusters=50 alpha=1 inference=VB (maxVbIter=1000 covariance=full threshold=0.001)clusterSumOut=casuser.clustersum clusterCovOut=casuser.clustercov;

input %dm_interval_input;

score out=casuser.Score_GMM copyvars=(_ALL_);

ods select nObs descStats modelInfo clusterInfo;

run;

/* Printing scored and output datasets */

proc print data=casuser.Score_GMM (obs=10);run;

proc print noobs data=casuser.clustersum;run;

proc print noobs data=casuser.clustercov;run;

The above statements run the GMM procedure and output the results to CAS tables and ODS tables. In the GMM procedure, the Gaussian mixture model is Bayesian. It generates the model parameters (means, covariances, and mixture proportions) along with their prior distributions. Observations are then generated based on these parameters and their likelihood distributions. The parameters are estimated by their posterior distributions.

In PROC GMM, the number of clusters is controlled by the Dirichlet process. You can specify the mass parameter of the Dirichlet process using the ALPHA option. A larger value for ALPHA tends to discover more clusters in the input data. By default, ALPHA is set to 1. Additionally, you can set the maximum number of possible clusters using the MAXCLUSTERS option. This option specifies the upper limit for the number of clusters that the algorithm can generate.

In this example, COVARIANCE=FULL is specified to capture the possible variations between the clusters. Also, the MAXVBITER= option is specified using a large value and the THRESHOLD= option is specified using a small value to allow a sufficient Variational Bayesian convergence. The VB inference that PROC GMM uses provides a choice of two covariance matrix types for the Gaussian distributions in the Gaussian mixture model: diagonal and full.

The output is stored in the casuser.Score_GMM data set, where casuser refers to a CAS library reference. The first ten observations of the scored data are printed for review. The GMM procedure also generates the cluster covariance table casuser.clusterCov. and the cluster summary table casuser.clusterSum.

Finally, cluster visualizations are generated using two-dimensional scatter plots of the MeanHHSz, MedHHInc and RegDens variables, labeling each point with a cluster number.

/* Creating cluster visualizations */

ods graphics / reset width=4in height=4in imagename='Clusters';

proc sgplot data=casuser.Score_GMM;

scatter x=MeanHHSz y=MedHHInc / group=_predicted_cluster_ markerattrs=(symbol=CircleFilled size=9);

run;

proc sgplot data=casuser.Score_GMM;

scatter x=MedHHInc y=RegDens / group=_predicted_cluster_ markerattrs=(symbol=CircleFilled size=9);

run;

proc sgplot data=casuser.Score_GMM;

scatter x=RegDens y=MeanHHSz / group=_predicted_cluster_ markerattrs=(symbol=CircleFilled size=9);

run;

Now that the model has been fitted, let's review the results of the GMM SAS Code node.

The Cluster Information table provides basic information about the clustering results. This information includes the cluster weights, cluster means, and cluster variances (the diagonal elements of the cluster covariances).

The Model Information table displays basic information about the parameters that are used in the Gaussian mixture model. The GMM procedure used the Dirichlet process in the Gaussian mixture model and identified 19 as the optimal number of clusters for the input data. It used a parallel variational Bayesian (VB) method of model inference. Variational Bayesian methods are a family of techniques for approximating the intractable integrals that arise in the posterior distributions of Bayesian models with computable alternative distributions.

In the scored data set, the _PREDICTED_CLUSTER_ variable contains the cluster assignments, and the _CLUSTER_k_ variables hold the posterior weights, which are the soft clustering values.

In the Cluster Summary table, you can see that PROC GMM performs well in assigning the observations to 19 clusters and also finds the cluster centers.

In contrast to the MBC solution where outliers were not assigned to any clusters, the GMM solution incorporated these outliers into a more flexible cluster framework, allowing for a higher number of clusters. Some of the smallest clusters contain as few as 2 or 6 observations, while the largest clusters encompass 7849 and 6908 observations. The GMM procedure assumes that each data point is generated from a mixture of Normal distributions. This method inherently models the noise component as part of these distributions. In the context of GMM procedure, the noise can be thought of as the variability within each of the Gaussian distributions that make up the mixture model.

The marker symbol and color indicate which cluster each point was assigned to. These results depict nineteen clusters, exhibiting varying degrees of correlation among them. The marker is determined by the PREDICTED_CLUSTER_ variable in the output data set.

Concluding Remarks

PROC GMM is a soft clustering method that assumes each datum is generated from a mixture of Normal distributions, generalizing k-means clustering to include information about the data's covariance and the centers of the latent Gaussian models. It uses a Dirichlet process to determine the best number of clusters and is particularly useful for data exploration.

If you have a general idea about the number of clusters and prefer to proceed without imputing missing values and/or identifying outliers, then PROC MBC is suitable. On the other hand, if you are uncertain about the number of clusters and seek a more flexible clustering solution where each observation is assigned to mutually exclusive clusters, then PROC GMM is the preferred choice.

Unlike conventional clustering methods, both PROC GMM and PROC MBC are capable of generating "soft" clusterings through posterior probabilities. This eliminates the need to rely solely on cluster memberships, allowing for more subjective exploration using the soft clustering values.

Find more articles from SAS Global Enablement and Learning here.

Model-Based Clustering (Part 3): The GMM Procedure Demystified

Registration is open

SAS AI and Machine Learning Courses