turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- Evaluating clusters for optimal K

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-26-2017 05:58 PM

Hi,

I built 4 clustering models i.e. 3 manually and stepping down from K15 -> K6 -> K4 and 1 using automatic selection with the Cluster node in SAS Enterprise Miner. The cluster statistics for the 4 models are,

The results are the exactly the same for Clustering K4 and Clustering Auto. I have come to determine that a 4 clusters model is optimum.

- Are these the correct metrics to evaluate clusters and to determine the optimal number of K? I used cluster distance plots to visually determine as well.
- Pseudo_F: Is this the higher the better?
- RSQ and RSQ_Ratio: Are these the lower the better?
- If these 4 metrics are not the best metrics to determine the optimal number of clusters, what are the appropriate ones generated from the Clustering node in SAS EM?

Thanks,

Lobbie

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-02-2017 07:39 PM

Hi Lobbie, see below for some comments around these.

- Think these are fine as a guide, but suggest a little trial an error here - you also want the clusters to fit the purpose, not just the best from a statistical sense. So you can also play around with which variables to use, and profiling to get a sense of the solution (can use the segment profile node here). This gives some more detail around approaches to selecting the no. of clusters: https://v8doc.sas.com/sashtml/stat/chap8/sect10.htm
- Yes, it measures the separation of the clusters, so higher is better
- It's the higher the better for both. RSQ this is the proportion of variance accounted for in the data, and RSQ_Ratio is similar but takes into account within vs between cluster variance. These will keep increasing to a maximum where the number of clusters = the numbers of cases, so you're not looking for the higheset but actually an inflection point where the rate of increase is small
- Also try looking at the CCC plot and see if there's some levelling here.

Cheers,

Troy