BookmarkSubscribeRSS Feed
ycenycute
Obsidian | Level 7

 In SAS EM, does clustering node have elbow method to select the optimal # of clusters, like the figure below?

 

Screen Shot 2021-10-24 at 23.19.03.png

 

I am aware of CCC plot. What does cubic clustering cutoff mean? How does it determine the optimal # of clusters based on this and why is that? 

6 REPLIES 6
MelodieRush
SAS Employee
HPCluster has the ABC method to determine optimal number of clusters. It’s in the HPDM tab.

Catch the SAS Global Forum keynotes, announcements, and tech content!
sasglobalforum.com | #SASGF



ycenycute
Obsidian | Level 7
Oh, thanks good to know. Can you explain more about ABC method. I got this figure. How to interpret this figure and what is the Gap on the vertical axis?

sbxkoenk
SAS Super FREQ

Hello,

 

Some more info on ABC :

 

SAS Enterprise Miner: High-Performance procedures
The HPCLUS Procedure
Finding the Number of Clusters

https://go.documentation.sas.com/doc/en/emhpprcref/14.2/emhpprcref_hpclus_details05.htm

 

SAS Video Portal :

The ABCs of Selecting Clusters

Brett Wujek talks about clustering, specifically about a relatively new methodology developed at SAS for determining a good or appropriate number of clusters for data called the Aligned Box Criterion, or ABC method.


https://video.sas.com/detail/video/4572850292001/the-abcs-of-selecting-clusters

 

Good luck,

Koen

ycenycute
Obsidian | Level 7
Thanks for the link. This is informative. But can you explain the intuition behind all these measures, CCC, and ABC without the math? Why do we want to pick the first peak as our optimal # of clusters?

I need to understand why instead of just clikcing and getting the results.
DougWielenga
SAS Employee

@ycenycute  --  The thing to understand about any such cluster selection approach ("elbow", CCC, ABC, etc...) is that there is no "right" answer.  All approaches effectively attempt to identify where the value of creating a larger number of clusters provides a smaller return in "value".   Since there is no right answer for the "correct" number of clusters, it is common to generate several cluster solutions and evaluate the usefulness of each clustering solution in light of your business/research questions of interest.  Understanding the nuances of how each approach to identifying good candidate solutions would require an understanding of the mathematics used in generating any statistic used both in the clustering and in the assessment of the clusters.   For example, use of a distance metric based on squaring the deviations might give a very different clustering than simply taking the absolute value of the deviations in which larger deviations are not penalized as greatly.  Even if you have a good understanding of the those metrics, you must consider any candidate solution in light of the original research/business question. 

 

It would be entirely expected for two people with the same data set but different business needs to settle on completely different cluster solutions as ideal.  For example, someone wanting to identify non-trivial group sizes for the purposes of marketing might tend toward a smaller number of clusters and might even ignore outliers to better separate the people in the middle of the pack to keep each market segment nontrivial.   However, someone looking at the same data and trying to understand new market opportunities might be willing to create a larger number of clusters so they could look toward the small clusters at the fringes which though small are emerging over time to identify new areas of opportunity.   In either case, there might not be a particular metric that chooses the ultimate cluster solution for the business problem.   The metrics get us closer to identifying good candidates but it is always good to look at a range of nearby solutions in order to better identify the best cluster solution for a particular business problem. 

 

Another thing to consider is that cluster solutions depends on the variables that are included, so adding a variable or subtracting a variable changes the potential solution.  If you try and put a bunch of variables into a single cluster solution, chances are there are only a small subset of those variables which are really driving the clustering.  In many cases, it makes more sense to create several cluster solutions for different subsets of variables that are reasonably considered together.  For example, suppose you had information related to recency of purchases, frequency of purchases, and amount of purchases over various time windows (e.g. over the last 30, 60, 90, 180, 360 days).  Rather than cramming all of the variables into one cluster solution, it might be far more effective to cluster each of the subgroups of variables separately.  You could then build a profile for each potential buyer based on the cluster prediction from each of the three cluster solutions (Recency, Frequency, Monetary) which would build a clearer picture of your candidates. Again, since there is no "correct" cluster solution, you can build any such candidate cluster solutions based on your particular business need.  The choice among them in the end is more likely to be driven by the business/research question than by any particular metric.      

 

I hope this helps!

Cordially,
Doug

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 2183 views
  • 0 likes
  • 4 in conversation