- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
How can we determine the number of Optimal cluster in cluster analysis?
Thanks,
Nikhil
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I think there are no strict rules for optimal number of clusters and as in all cluster analysis – there is a lot of room for variations and interpretation.
Maybe someone can give more specific criteria, but the ones I would consider:
* Use of graphical analysis to understand if your clusters are well separated, maybe some are very close and can be joined. I think also a tree (PROC TREE) is a very useful tool. There you can see how many groups (more separated tree branches) you have.
* Most likely you wouldn’t like to have clusters with just 1 or few observations.
* In some cases your data or task can give hint about number of clusters (e.g. maybe you want to separate items with high, low and middle level of something).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I think there are no strict rules for optimal number of clusters and as in all cluster analysis – there is a lot of room for variations and interpretation.
Maybe someone can give more specific criteria, but the ones I would consider:
* Use of graphical analysis to understand if your clusters are well separated, maybe some are very close and can be joined. I think also a tree (PROC TREE) is a very useful tool. There you can see how many groups (more separated tree branches) you have.
* Most likely you wouldn’t like to have clusters with just 1 or few observations.
* In some cases your data or task can give hint about number of clusters (e.g. maybe you want to separate items with high, low and middle level of something).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
For hierarchical clustering try the Sarle's Cubic Clustering Criterion in PROC CLUSTER:
plot _CCC_ versus the number of clusters and look for peaks where _ccc_ > 3 or look for local peaks of pseudo-F statistic (_PSF_) combined with a small value of the pseudo-t^2 statistic (_PST2_) and a larger pseudo t^2 for the next cluster fusion
For K-Means clustering use this approach on a sample of your data to determine the max limit for k and assign it to the maxc= option in PROC FASTCLUS on the complete data.