BookmarkSubscribeRSS Feed
prooney2
Fluorite | Level 6
I have developed 8 and 6 cluster solutions from proc fastclus. I have a manager who claims that the ratio of the average between cluster distances and the average within cluster distances might be a measure of the "best number of clusters" to consider:

Ratio = Mean Between-cluster distance / Mean within-cluster distance

Using proc fastclus and proc distance I can calculate the distances of each object to each cluster centroid, and I can calculate the distances of each cluster centroid to the other cluster centroids, but does this measure even make sense? My intuition says that an 8 cluster and 6 cluster solution are inherently incomparable, that the number of clusters by itself makes the variability of one cluster solution different from another.

Wouldn't I be better off with hierarchical clustering and using the psuedo-F statistics and the other measures found in the SAS documentation for identifying the number of clusters?
2 REPLIES 2
mjbstats
Calcite | Level 5
Hello prooney2,

I am working on a similar problem and am a newbie to Cluster Analysis. I too have been told to calculate Between/Within cluster variance measures and use those to choose the best number of clusters. So, although your question sounds legitimate to me, I don't have an answer. I'm seeking help myself!

I am wondering if FASTCLUS makes the most sense for my application. I am doing a very simple clustering of one dependent variable, nonzero values, ranging from 221 to 595, n=900 observations. I'm looking for disjoint clusters in that each observation should belong to only one cluster in the end.

For this most simple application, does FASTCLUS sound like the correct procedure to use? If not, why not, and what other procedures would you recommend?
EyalGonen
Lapis Lazuli | Level 10

Hello Pronney2,

I am by no means a statistician nor a mathemitician but I am aware of a sample code shipping with IML Studio called FishClusters.sx. This code attempts to find the best number of clusters using different criterias. Maybe it can help you.

Eyal

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1568 views
  • 0 likes
  • 3 in conversation