Cluster Profiling is Right, Part 2: Graphics

In this post, I follow up on my previous post about cluster profiling. In that post, “Don't Listen to Ron White. Cluster Profiling is Right!” I provided an argument that profiling after clustering or segmentation is a necessary part of the clustering process. Here, I will add some graphical techniques to help make differences between clusters more apparent, helping us to profile.

In the last post, I showed how PROC FASTCLUS in SAS^® reports tables of cluster means and cluster standard deviations. This can be used as a rudimentary start to cluster profiling. We simply look at the mean vectors for each of the clusters and see how the clusters differentiate from one another.

I had mentioned that this technique can be misleading because it doesn’t take into account the distributions of the profiling variables, but only a single point. It also becomes difficult to just look at a series of numbers when the number of numbers increases, such as when there are more profiling variables or more clusters.

An intuitive graphical technique for analyzing cluster differences is through comparative histograms. The coding for this example uses PROC SGPANEL. I’ll use some simple SAS^® macro coding to reduce redundancy of code.

/* Create a macro to generate a profile plot for each profiling variable. */

%macro profilepanel(dsn=,clusvars=);


    %let k=1;

    %let dep = %scan(&clusvars, &k);


    %do %while(&dep NE);

        proc sgpanel data=&dsn;

            panelby cluster / columns=1 onepanel;

            histogram &dep / scale=percent ;

            density &dep / type=kernel;

            rowaxis max=100;

        run;


        %let k = %eval(&k + 1);

        %let dep = %scan(&clusvars, &k);


    %end;

%mend profilepanel;

The critical part of the code is in the PROC SGPANEL step.

The PANELBY statement names the stratification variable. The options COLUMNS=1 and ONEPANEL assure that all cluster histograms are in one plot, with the clusters organized in one column.

The HISTOGRAM statement requests histograms, using percent, rather than frequency on the Y axis.

The DENSITY statement requests overlaid kernel density curves on the histograms, easing visual comparisons.

The ROWAXIS statement specifies that the axis values for the Y Axis (Rows) display a maximum of 100 (for 100 percent).

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Cluster 1 (top of each panel plot) seems to profile as low literacy rates, low primary school enrollment (both male and female), and high fertility rate.

Cluster 3 (bottom of each panel plot) profiles as high literacy rates, high primary school enrollment, and low fertility rate.

Cluster 2 (middle of each panel plot) profiles as moderate in all measures of literacy, primary school enrollment, and fertility rate.

When there are many clusters, you would need to compare many panels. When there are many profiling variables, you would need to generate many panel plots. Trying to simultaneously profile several clusters on several variable becomes a bit unwieldy.

Another strategy for illustrating differences between clusters and profiling them employs a four-step strategy:

Create K binary indicator variables, one for each of the k clusters. Each will be coded 1 if a member of the cluster and 0 if not.
Running k separate logistic regression analyses, using the profiling variables in the x role and the binary cluster variable in the y role.
Retain all profiling variables that reach a certain pre-determined alpha level and order them by p-value (smallest first).
For each of the k-clusters, create a comparative histogram plot, comparing members of cluster i with members of all other clusters.

There is no single procedure for doing this. I am leaving the actual code out here, but I use PROC SQL for step 1, PROC LOGISTIC for steps 2 and 3, and PROC SGPLOT for step 4. I run the code for the various clusters and profiling variables using SAS Macro coding, employing %DO … %UNTIL loops. The results are below.

I am adding a variable, popUrban (percent of population in urban areas) to the list of profiling variables, even though it was not used in clustering. You are not limited to profiling using the clustering variables.

Cluster 1 vs. Not Cluster 1

Analysis of Effects Eligible for Entry
Effect	DF	Score Chi-Square	Pr > ChiSq
AdultLiteracypct	1	47.0500	<.0001
FemaleSchoolpct	1	58.8651	<.0001
MaleSchoolpct	1	51.2717	<.0001
totalFR	1	33.0478	<.0001
popUrban	1	13.7502	0.0002

These plots give the same information as the previous plots but organized differently. Also, notice that Cluster 1 seems to include on average a lesser population percentage in urban areas than other clusters.

Cluster 2 vs. Not Cluster 2

Analysis of Effects Eligible for Entry
Effect	DF	Score Chi-Square	Pr > ChiSq
AdultLiteracypct	1	20.2576	<.0001
FemaleSchoolpct	1	21.4569	<.0001
MaleSchoolpct	1	18.8542	<.0001
totalFR	1	35.4597	<.0001
popUrban	1	10.6113	0.0011

Cluster 2 seems to include on average a slightly lesser population percentage in urban areas than other clusters.

Cluster 3 vs. Not Cluster 3

Analysis of Effects Eligible for Entry
Effect	DF	Score Chi-Square	Pr > ChiSq
AdultLiteracypct	1	67.2468	<.0001
FemaleSchoolpct	1	77.6231	<.0001
MaleSchoolpct	1	67.8925	<.0001
totalFR	1	78.1009	<.0001
popUrban	1	26.7824	<.0001

Cluster 3 seems to include on average a slightly greater population percentage in urban areas than other clusters.

If you want to learn more about clustering, take a look at our Course: Applied Clustering Techniques (sas.com).

I hope that I have convinced you in this post that using graphs to aid in profiling generated clusters is more informative than analyzing tables of summary statistics. I have introduced two methods, but they are by no means the only way to obtain information for profiling.

Find more articles from SAS Global Enablement and Learning here.

Cluster Profiling is Right, Part 2: Graphics

Free course: Data Literacy Essentials

Get Started