BookmarkSubscribeRSS Feed

Model-Based Clustering (Part 3): The GMM Procedure Demystified

Started ‎11-05-2024 by
Modified ‎11-05-2024 by
Views 1,037

In earlier installments of this series, Model-Based Clustering (Part 1): Exploring Its Significance, we explored the concept of model-based clustering, highlighting its benefits and examining the various models offered by SAS Viya. Then, in Model-Based Clustering (Part 2): A Detailed Look at the MBC Procedure, we delved into the practical steps of implementing model-based clustering using the MBC procedure.

 

We will now apply PROC GMM to analyze the Census2000 dataset, which provides a summary of the 2000 United States Census at the postal code level. Recall that the goal of the analysis is to group geographical regions in the United States into distinct subsets based on three demographic factors: Average household size in the region (MeanHHSz), Median household income in the region (MedHHInc), and Region population density percentile (1=lowest density, 100=highest density) (RegDens). These factors are common to commercial lifestyle and life-stage segmentation products.

 

Although model-based clustering can be executed in SAS Viya solely through code, this doesn't prevent us from leveraging it within the Model Studio pipeline interface. Despite the lack of a dedicated node for GMM, similar to the MBC procedure, we seamlessly integrated a SAS Code node into a pipeline, as illustrated below.01_SS_MBCpipeline-1024x549.jpg

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

The importance of incorporating a Data Exploration node, Filtering node, and Clustering node was previously elucidated and discussed in the preceding installment of this series.

 

Fitting A Nonparametric Bayesian Gaussian Mixture Model

 

Now, let's fit a nonparametric Bayesian Gaussian mixture model. Following code is used in the GMM SAS Code node:

 

/* Executing Nonparametric Bayesian Gaussian mixture model using GMM procedure */
proc gmm data=&dm_data
seed=12345 maxClusters=50 alpha=1 inference=VB (maxVbIter=1000 covariance=full threshold=0.001)clusterSumOut=casuser.clustersum clusterCovOut=casuser.clustercov;
input %dm_interval_input;
score out=casuser.Score_GMM copyvars=(_ALL_);
ods select nObs descStats modelInfo clusterInfo;
run;
/* Printing scored and output datasets */
proc print data=casuser.Score_GMM (obs=10);run;
proc print noobs data=casuser.clustersum;run;
proc print noobs data=casuser.clustercov;run;

 

The above statements run the GMM procedure and output the results to CAS tables and ODS tables. In the GMM procedure, the Gaussian mixture model is Bayesian. It generates the model parameters (means, covariances, and mixture proportions) along with their prior distributions. Observations are then generated based on these parameters and their likelihood distributions. The parameters are estimated by their posterior distributions.

 

In PROC GMM, the number of clusters is controlled by the Dirichlet process. You can specify the mass parameter of the Dirichlet process using the ALPHA option. A larger value for ALPHA tends to discover more clusters in the input data. By default, ALPHA is set to 1. Additionally, you can set the maximum number of possible clusters using the MAXCLUSTERS option. This option specifies the upper limit for the number of clusters that the algorithm can generate.

 

In this example, COVARIANCE=FULL is specified to capture the possible variations between the clusters. Also, the MAXVBITER= option is specified using a large value and the THRESHOLD= option is specified using a small value to allow a sufficient Variational Bayesian convergence. The VB inference that PROC GMM uses provides a choice of two covariance matrix types for the Gaussian distributions in the Gaussian mixture model: diagonal and full.

 

The output is stored in the casuser.Score_GMM data set, where casuser refers to a CAS library reference.  The first ten observations of the scored data are printed for review. The GMM procedure also generates the cluster covariance table casuser.clusterCov. and the cluster summary table casuser.clusterSum.

 

Finally, cluster visualizations are generated using two-dimensional scatter plots of the MeanHHSz, MedHHInc and RegDens variables, labeling each point with a cluster number.

 

/* Creating cluster visualizations */
ods graphics / reset width=4in height=4in imagename='Clusters';
proc sgplot data=casuser.Score_GMM;
scatter x=MeanHHSz y=MedHHInc / group=_predicted_cluster_ markerattrs=(symbol=CircleFilled size=9);
run;
proc sgplot data=casuser.Score_GMM;
scatter x=MedHHInc  y=RegDens / group=_predicted_cluster_  markerattrs=(symbol=CircleFilled size=9);
run;
proc sgplot data=casuser.Score_GMM;
scatter x=RegDens y=MeanHHSz / group=_predicted_cluster_  markerattrs=(symbol=CircleFilled size=9);
run;

 

 

Now that the model has been fitted, let's review the results of the GMM SAS Code node.

 

02_SS_GMMclusterInfo-1024x537.jpg

 

The Cluster Information table provides basic information about the clustering results. This information includes the cluster weights, cluster means, and cluster variances (the diagonal elements of the cluster covariances).

 

03_SS_GMMmodelInfo.jpg

 

The Model Information table displays basic information about the parameters that are used in the Gaussian mixture model. The GMM procedure used the Dirichlet process in the Gaussian mixture model and identified 19 as the optimal number of clusters for the input data. It used a parallel variational Bayesian (VB) method of model inference. Variational Bayesian methods are a family of techniques for approximating the intractable integrals that arise in the posterior distributions of Bayesian models with computable alternative distributions.

   

04_edited_SS_GMMscore2-1024x316.jpg

 

05_edited_SS_GMMscore1-300x196.jpg

 

In the scored data set, the _PREDICTED_CLUSTER_ variable contains the cluster assignments, and the _CLUSTER_k_ variables hold the posterior weights, which are the soft clustering values.

 

06_SS_GMMsize.jpg

 

In the Cluster Summary table, you can see that PROC GMM performs well in assigning the observations to 19 clusters and also finds the cluster centers.

 

In contrast to the MBC solution where outliers were not assigned to any clusters, the GMM solution incorporated these outliers into a more flexible cluster framework, allowing for a higher number of clusters. Some of the smallest clusters contain as few as 2 or 6 observations, while the largest clusters encompass 7849 and 6908 observations. The GMM procedure assumes that each data point is generated from a mixture of Normal distributions. This method inherently models the noise component as part of these distributions. In the context of GMM procedure, the noise can be thought of as the variability within each of the Gaussian distributions that make up the mixture model.

 10_combo_SS.jpg

 

The marker symbol and color indicate which cluster each point was assigned to. These results depict nineteen clusters, exhibiting varying degrees of correlation among them. The marker is determined by the PREDICTED_CLUSTER_ variable in the output data set.

 

Concluding Remarks

 

PROC GMM is a soft clustering method that assumes each datum is generated from a mixture of Normal distributions, generalizing k-means clustering to include information about the data's covariance and the centers of the latent Gaussian models. It uses a Dirichlet process to determine the best number of clusters and is particularly useful for data exploration.

 

If you have a general idea about the number of clusters and prefer to proceed without imputing missing values and/or identifying outliers, then PROC MBC is suitable. On the other hand, if you are uncertain about the number of clusters and seek a more flexible clustering solution where each observation is assigned to mutually exclusive clusters, then PROC GMM is the preferred choice.

 

Unlike conventional clustering methods, both PROC GMM and PROC MBC are capable of generating "soft" clusterings through posterior probabilities. This eliminates the need to rely solely on cluster memberships, allowing for more subjective exploration using the soft clustering values.

 

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
‎11-05-2024 01:51 AM
Updated by:
Contributors

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started

Article Tags