Model-Based Clustering (Part 2): A Detailed Look at the MBC Procedure

In the previous installment of this blog series, Model-Based Clustering (Part 1): Exploring Its Significance, you were introduced to model-based clustering, its advantages, and the available models within SAS Viya.

We will now apply PROC MBC to analyze the Census2000 dataset, which provides a summary of the 2000 United States Census at the postal code level. The goal of the analysis is to group geographical regions in the United States into distinct subsets based on three demographic factors: Average household size in the region (MeanHHSz), Median household income in the region (MedHHInc), and Region population density percentile (1=lowest density, 100=highest density) (RegDens). These factors are common to commercial lifestyle and life-stage segmentation products. The data is suitable for the creation of life-stage, lifestyle segments using clustering technique.

While model-based clustering can be implemented in SAS Viya using only code, this does not limit us from utilizing it within the Model Studio pipelines interface. Despite the absence of dedicated node for MBC, we integrated SAS Code node into a pipeline as demonstrated below.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The findings from the Data Exploration node showed a notable spike in the histogram of MeanHHSz around or near zero.

Having a household size of zero is not plausible in census data. Consequently, a lower limit of 0.1 is specified in the metadata. All instances with an average household size below 0.1 are filtered out for further analysis. The Filtering node excluded 1081 cases with a household size of zero. Subsequently, a Clustering node is introduced downstream to conduct traditional k-means clustering for comparative purposes. k-means clustering suggests a five-cluster solution. If you're interested in learning about nitty gritty of k-means clustering, I recommend the Course: Advanced Machine Learning Using SAS® Viya®

A SAS Code node was inserted below the Filtering node to execute the MBC procedure for fitting a Gaussian mixture model.

Fitting A Gaussian Mixture Model

First, let's fit a Gaussian mixture model. Following code is used in the MBC SAS Code node:

/* Executing Gaussian mixture model using MBC procedure */

proc mbc data=&dm_datanclusters=(2 3 4) noise=(n y) covstruct=(vii vvi vvv eev);

var %dm_interval_input;

output out=casuser.Score_MBC copyvars=(ID %dm_interval_input) maxpost nextclus=wt;

run;

/* Printing scored data */

proc print data=casuser.Score_MBC (obs=10);run;

&dm_data is a macro variable identifies the CAS training table from the preceding node.

There are three options that control the models that will be evaluated:

The NCLUSTERS option specifies a list of values for the number of multivariate Gaussian clusters - two, three or four. PROC MBC fits a separate model for each value in the list.
The NOISE option specifies whether the model should include a uniform noise component. The existence of outliers is a good motivation to include a noise component in the model, because the outliers can otherwise distort the cluster structure in the model. Specifying yes (or y) and no (or n) will generate models both with and without a noise component.
The COVSTRUCT option specifies the covariance structure to use in the model. Nine covariance structures are available for fitting Gaussian mixture models: COVSTRUCT=EEE | EEI | EEV | EII | EVI | EVV | VII | VVI | VVV. See the Details: Covariance Structure section of SAS Documentation.

Each element in these lists will be used in combination with the elements from the other lists, giving 24 models (3 clusters × 2 noise options × 4 covariance structures) in all, 12 with noise component and 12 without noise component.

The output is stored in the casuser.Score_MBC data set, where casuser refers to a CAS library reference. The first ten observations of the scored data are printed for review.

Finally, cluster visualizations are generated using two-dimensional and three-dimensional scatter plots of the MeanHHSz, MedHHInc and RegDens variables, labeling each point with a cluster number.

/* Creating cluster visualizations */

ods graphics / reset width=4in height=4in imagename='Clusters';

proc sgplot data=casuser.Score_MBC;

scatter x=MeanHHSz y=MedHHInc / group=maxpost markerattrs=(symbol=CircleFilled size=9);

run;

proc sgplot data=casuser.Score_MBC;

scatter x=MedHHInc y=RegDens / group=maxpost markerattrs=(symbol=CircleFilled size=9);

run;

proc sgplot data=casuser.Score_MBC;

scatter x=RegDens y=MeanHHSz / group=maxpost markerattrs=(symbol=CircleFilled size=9);

run;

/* Create Shape and Color columns based on cluster membership in MaxPost column*/

proc fedsql sessref =&dm_cassessref;

create table casuser.Score_MBC_new {options replace=true} as

select *,

case

when maxpost=0 then 'balloon'

when maxpost=1 then 'club'

when maxpost=2 then 'diamond'

when maxpost=3 then 'spade'

when maxpost=4 then 'heart'

when maxpost=. then 'point'

end as Shape,

case

when maxpost=0 then 'Burlywood'

when maxpost=1 then 'DodgerBlue'

when maxpost=2 then 'Crimson'

when maxpost=3 then 'LimeGreen'

when maxpost=4 then 'MediumOrchid'

when maxpost=. then 'Grey'

end as Color

from casuser.Score_MBC;

quit;

proc g3d data=casuser.Score_MBC_new;

note j=r "Clusters:"c=Burlywood "Outliers" j=r c=DodgerBlue "Cluster 1" j=r c=Crimson "Cluster 2" j=r c=LimeGreen "Cluster 3" j=r c=MediumOrchid "Cluster 4" j=r c=Grey "Missing";

scatter RegDens*MedHHInc = MeanHHSz / color=Color shape=Shape size=1 noneedle;

run;

Now that we have fit the model, let's examine the results of the MBC SAS Code node.

The Model Information table displays basic information about the model. The selected model has four Gaussian clusters. The selected model includes a noise cluster also. Estimation is carried out using the Expectation-Maximization (EM) algorithm. The best model is selected based on the Bayesian Information Criterion (BIC).

The parameter estimates and mixing probability estimates describe the mixture model that was selected. The parameter estimates describe the mean and covariance of each Gaussian cluster. In the Mixing Probability Estimates table, the numbering starts at zero because this model did include a noise component; component zero is the noise component. These values correspond to the prior weights in the mixture model likelihood that is presented in the section Log-Likelihood Definitions.

The Model Selection Summary table displays several likelihood-based measures of fit for the specified models. The table also includes the number of parameters that are used in the computation of the fit statistics. The top model is the selected one because it has the lowest BIC value.

In the scored data set, the MAXPOST variable contains the cluster assignments, and the WTk variables hold the posterior weights. You can examine these soft clustering values, which are stored in the wt0, wt1, wt2, wt3, and wt4 variables in the output data set. The wt0 values indicate strength of association with the noise cluster.

The marker symbol and color indicate which cluster each point was assigned to. These results represent four clusters, each with strong correlation, and some background noise represented by blue dots. The marker is determined by the MAXPOST variable in the output data set. Clusters 1 and 2 are not clearly visible in these 2-dimensional scatter plots. A 3-dimensional scatter plot might provide better clarity.

All clusters are clearly visible in the three-dimensional scatter plot. Outliers are indicated by beige balloon markers. These outliers typically have extreme values in one or more inputs, placing them far from or outside the dense cloud of each cluster.

Concluding Remarks

PROC MBC fits mixtures of multivariate Gaussian and uniform distributions for unsupervised and semisupervised clustering of data. It treats cluster memberships as missing data and uses the expectation-maximization (EM) algorithm to maximize the likelihood. Unlike other clustering procedures, PROC MBC can produce "soft" clusterings based on posterior probabilities, indicating relative strengths of affinity that each observation has for each cluster in the model.

The subsequent section of this series Model-Based Clustering (Part 3): The GMM Procedure Demystified will cover the implementation of PROC GMM.

Find more articles from SAS Global Enablement and Learning here.

Model-Based Clustering (Part 2): A Detailed Look at the MBC Procedure

Register Today!

Free course: Data Literacy Essentials

Get Started