BookmarkSubscribeRSS Feed

Don’t Listen to Ron White. Cluster Profiling is Right!

Started a week ago by
Modified a week ago by
Views 117

Comedian Ron White is quoted as saying, “I had one DWI (Driving While Impaired charge), which was a bogus charge, because it turns out they were stopping every vehicle driving down that particular sidewalk. That’s profiling. And profiling is wrong.”

 

I must disagree with Mr. White.  When you are clustering or segmenting data, then profiling is right.

 

In this post, I will present a counterargument to Mr. White’s thesis, with illustration using an example and explanation.  I will illustrate my point using a basic example of cluster analysis and then explain why profiling is not only right, but also necessary in many applications.  Don't worry.  I'll remind you what cluster profiling is.

 

I am going to assume some basic familiarity with cluster analysis.  To briefly review, clustering (cluster analysis or segmentation) is a process of grouping units into homogeneous groups.  Group membership isn’t defined prior to clustering.  The goal is to find optimal groupings.

 

Therein lies the problem.  How do we define optimal?  If there were a target (or dependent) variable, we can use it as a “supervisor” to group the units.  Optimality would be when the algorithm exactly replicates the groupings in the target variable.  That type of analysis is called supervised analysis.

 

We cluster or segment because we have no supervisor.  It’s like when kids are in school and the teacher hasn't yet arrived.  The unsupervised kids will group themselves however they want, and nobody can tell them what to do!  (Does this sound familiar to anyone?)  Sounds a bit like anarchy, doesn't it?  The difference is that in unsupervised clustering, the analyst would have some idea about which groupings make sense.

 

One of the most common methods for clustering is Ward’'s Method – one of the hierarchical algorithms designed to minimize within cluster variability (increase homogeneity among members) and maximize between cluster variability (increase heterogeneity between clusters).  This process doesn’t always produce clusters that even make sense to the analyst, but it works enough that it is very popular.

 

For larger data sets, hierarchical clustering methods are too burdensome on computer processing. We often eschew the term “clustering” in these cases, in favor of the term, “segmentation”. For segmentation we often turn to the k-means algorithm, which finds clusters faster, with much less burden on processing.

 

For simplicity, I’m going to show an example of k-means clustering.  I’ll employ SAS® PROC FASTCLUS for this purpose.

 

I'll use the Demographics data set in the SASHELP library.  The data set contains summary demographic information about various geographic regions. Each observation represents a different region.

 

In order to profile, we need to tell PROC FASTCLUS how many clusters we need, using the MAXCLUSTERS option in the PROC FASTCLUS statement. In this example, the clustering variables have already been standardized.  For simplicity, I limited the number of clusters to 3.

 

proc fastclus data=work.std_demographics

maxiter=25

maxclusters=3

out=work.clus_demographics;

var &inputs;

run;

 

One of the first tables presented shows the relative frequencies of subjects assigned to each of the three clusters.

 

Note: Since clustering was done on standardized variables, distances and summary statistics are also calculated on standardized variables.  The relative ordering of the clusters is maintained, but the actual values should not be interpreted.  You can un-standardize the cluster results and calculate summary statistics on the original values.  I will not do that here.

 

Cluster Summary
Cluster Frequency RMS Std Deviation Maximum Distance from Seed to Observation Radius Exceeded Nearest Cluster Distance Between Cluster Centroids
1 17 0.1618 0.4688   2 0.6536
2 42 0.1564 0.5283   1 0.6536
3 138 0.1060 0.4696   2 0.6714

 

It wouldn’t take very long to figure out that we still don’t really know much more than we did before clustering.  For instance, 42 subjects were assigned to Cluster 2, but how does that inform me about why these people were all clustered together?

 

Another table might help.  The Cluster Means table shows us how each cluster differs from others based on the average levels of each of the input variables used for clustering.  Without even trying, we’re starting to profile!

 

Cluster Means
Cluster AdultLiteracypct FemaleSchoolpct MaleSchoolpct totalFR
1 0.2677115987 0.1661092531 0.2358974359 0.7655838455
2 0.5766163793 0.5670874240 0.5946236559 0.5600568586
3 0.8879247190 0.8960560703 0.8923497268 0.1637464850

 

From this table, it seems that cluster 3 has the highest average levels of adult literacy, percentage of males enrolled in primary school and percentage of females enrolled in primary school, as well as the lowest average rate of fertility.

 

To truly profile, we would want to describe a typical member of the cluster.  Here is where we can become imaginative, hopefully based on a healthy familiarity with the population and measures used in the study.  For example, I might say that Cluster 3 represents regions with well-educated populations and low rates of reproduction.  The profile for cluster 1 seems opposite to the profile for Cluster 3.  The residents of its regions seem to be less well educated, with higher fertility rates.  Cluster 2 is between Cluster 3 and Cluster 1 on every measure, so its profile is that of an average-educated, average fertility rate region.

 

Of course, things are not always so simple.  The mean is a single representative value for a cluster, but clusters with higher variability might make you more cautious about your profiles.  A quick glance at the standard deviations table might help.

 

Cluster Standard Deviations
Cluster AdultLiteracypct FemaleSchoolpct MaleSchoolpct totalFR
1 0.2008249959 0.1151262643 0.1619965354 0.1579345609
2 0.1458292396 0.1631853140 0.1636044856 0.1520751919
3 0.1154077906 0.0866904666 0.0988721415 0.1197421649

 

It is difficult to envision what these differences mean.  Also, this example is quite rudimentary, with only 3 clusters to compare on 4 profiling variables.  What if you were faced with 10 clusters and wanted to base your profiling on 8 variables?  Some graphics might help. I’ll write about a few graphical aids to cluster profiling in a future post post.

 

If you want to learn more about clustering, take a look at our Course: Applied Clustering Techniques (sas.com).

 

In this post I hope that I have effectively contradicted Ron White’s assertion that “profiling is wrong”.  In many applications, clustering is not only the right thing to do, but essential to the entire process.

Version history
Last update:
a week ago
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags