I built 4 clustering models i.e. 3 manually and stepping down from K15 -> K6 -> K4 and 1 using automatic selection with the Cluster node in SAS Enterprise Miner.  The cluster statistics for the 4 models are,




The results are the exactly the same for Clustering K4 and Clustering Auto.  I have come to determine that a 4 clusters model is optimum.

  1. Are these the correct metrics to evaluate clusters and to determine the optimal number of K?  I used cluster distance plots to visually determine as well.
  2. Pseudo_F:  Is this the higher the better?
  3. RSQ and RSQ_Ratio:  Are these the lower the better?
  4. If these 4 metrics are not the best metrics to determine the optimal number of clusters, what are the appropriate ones generated from the Clustering node in SAS EM?





Re: Evaluating clusters for optimal K

Hi Lobbie, see below for some comments around these.



  1. Think these are fine as a guide, but suggest a little trial an error here - you also want the clusters to fit the purpose, not just the best from a statistical sense. So you can also play around with which variables to use, and profiling to get a sense of the solution (can use the segment profile node here).  This gives some more detail around approaches to selecting the no. of clusters:
  2. Yes, it measures the separation of the clusters, so higher is better
  3.  It's the higher the better for both.  RSQ this is the proportion of variance accounted for in the data, and RSQ_Ratio is similar but takes into account within vs between cluster variance.  These will keep increasing to a maximum where the number of clusters = the numbers of cases, so you're not looking for the higheset but actually an inflection point where the rate of increase is small
  4.  Also try looking at the CCC plot and see if there's some levelling here.



