In this post, I follow up on my previous post about cluster profiling. In that post, “Don't Listen to Ron White. Cluster Profiling is Right!” I provided an argument that profiling after clustering or segmentation is a necessary part of the clustering process. Here, I will add some graphical techniques to help make differences between clusters more apparent, helping us to profile.
In the last post, I showed how PROC FASTCLUS in SAS® reports tables of cluster means and cluster standard deviations. This can be used as a rudimentary start to cluster profiling. We simply look at the mean vectors for each of the clusters and see how the clusters differentiate from one another.
I had mentioned that this technique can be misleading because it doesn’t take into account the distributions of the profiling variables, but only a single point. It also becomes difficult to just look at a series of numbers when the number of numbers increases, such as when there are more profiling variables or more clusters.
An intuitive graphical technique for analyzing cluster differences is through comparative histograms. The coding for this example uses PROC SGPANEL. I’ll use some simple SAS® macro coding to reduce redundancy of code.
/* Create a macro to generate a profile plot for each profiling variable. */
%macro profilepanel(dsn=,clusvars=);
%let k=1;
%let dep = %scan(&clusvars, &k);
%do %while(&dep NE);
proc sgpanel data=&dsn;
panelby cluster / columns=1 onepanel;
histogram &dep / scale=percent ;
density &dep / type=kernel;
rowaxis max=100;
run;
%let k = %eval(&k + 1);
%let dep = %scan(&clusvars, &k);
%end;
%mend profilepanel;
The critical part of the code is in the PROC SGPANEL step.
The PANELBY statement names the stratification variable. The options COLUMNS=1 and ONEPANEL assure that all cluster histograms are in one plot, with the clusters organized in one column.
The HISTOGRAM statement requests histograms, using percent, rather than frequency on the Y axis.
The DENSITY statement requests overlaid kernel density curves on the histograms, easing visual comparisons.
The ROWAXIS statement specifies that the axis values for the Y Axis (Rows) display a maximum of 100 (for 100 percent).
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
Cluster 1 (top of each panel plot) seems to profile as low literacy rates, low primary school enrollment (both male and female), and high fertility rate.
Cluster 3 (bottom of each panel plot) profiles as high literacy rates, high primary school enrollment, and low fertility rate.
Cluster 2 (middle of each panel plot) profiles as moderate in all measures of literacy, primary school enrollment, and fertility rate.
When there are many clusters, you would need to compare many panels. When there are many profiling variables, you would need to generate many panel plots. Trying to simultaneously profile several clusters on several variable becomes a bit unwieldy.
Another strategy for illustrating differences between clusters and profiling them employs a four-step strategy:
There is no single procedure for doing this. I am leaving the actual code out here, but I use PROC SQL for step 1, PROC LOGISTIC for steps 2 and 3, and PROC SGPLOT for step 4. I run the code for the various clusters and profiling variables using SAS Macro coding, employing %DO … %UNTIL loops. The results are below.
I am adding a variable, popUrban (percent of population in urban areas) to the list of profiling variables, even though it was not used in clustering. You are not limited to profiling using the clustering variables.
Cluster 1 vs. Not Cluster 1
Analysis of Effects Eligible for Entry | |||
Effect | DF | Score Chi-Square | Pr > ChiSq |
AdultLiteracypct | 1 | 47.0500 | <.0001 |
FemaleSchoolpct | 1 | 58.8651 | <.0001 |
MaleSchoolpct | 1 | 51.2717 | <.0001 |
totalFR | 1 | 33.0478 | <.0001 |
popUrban | 1 | 13.7502 | 0.0002 |
These plots give the same information as the previous plots but organized differently. Also, notice that Cluster 1 seems to include on average a lesser population percentage in urban areas than other clusters.
Cluster 2 vs. Not Cluster 2
Analysis of Effects Eligible for Entry | |||
Effect | DF | Score Chi-Square | Pr > ChiSq |
AdultLiteracypct | 1 | 20.2576 | <.0001 |
FemaleSchoolpct | 1 | 21.4569 | <.0001 |
MaleSchoolpct | 1 | 18.8542 | <.0001 |
totalFR | 1 | 35.4597 | <.0001 |
popUrban | 1 | 10.6113 | 0.0011 |
Cluster 2 seems to include on average a slightly lesser population percentage in urban areas than other clusters.
Cluster 3 vs. Not Cluster 3
Analysis of Effects Eligible for Entry | |||
Effect | DF | Score Chi-Square | Pr > ChiSq |
AdultLiteracypct | 1 | 67.2468 | <.0001 |
FemaleSchoolpct | 1 | 77.6231 | <.0001 |
MaleSchoolpct | 1 | 67.8925 | <.0001 |
totalFR | 1 | 78.1009 | <.0001 |
popUrban | 1 | 26.7824 | <.0001 |
Cluster 3 seems to include on average a slightly greater population percentage in urban areas than other clusters.
If you want to learn more about clustering, take a look at our Course: Applied Clustering Techniques (sas.com).
I hope that I have convinced you in this post that using graphs to aid in profiling generated clusters is more informative than analyzing tables of summary statistics. I have introduced two methods, but they are by no means the only way to obtain information for profiling.
Find more articles from SAS Global Enablement and Learning here.
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.