BookmarkSubscribeRSS Feed
HJain
Calcite | Level 5

Hi Guys,

 

I am implementing cluster modelling technique on a data set using Cluster module under Explore tab.

The number of clusters are user specific in order to learn which k performs the best. I have chosen k = 5,6,7,8,9 for the same data set.

 

How can I actually deduce which k is the best. I have tried to calculate Intra Cluster Similarity for each cluster of each k using inter cluster distance given in the results but do not how to deduce anything fruitful.

 

Any help would be really appreciated.

 

Thank you.

 

3 REPLIES 3
HJain
Calcite | Level 5

Hi @WendyCzika.. Thank you for your response. What I understood from the post you referred to was that I should analyse the CCC plot. I don't know if its me doing something wrong or not, but when I am  specifying the number of clusters and NOT going through Automatic way, I do not get the CCC plot.

 

Also, when I have calculated Intra-Cluster Similarity, how can I use to deduce the best k?

 

Thank you so much!

RalphAbbey
SAS Employee

 One of the difficulties in determining the correct number of clusters is that intra-cluster similarity often increases as you increase k. This is because if you split a cluster into two smaller clusters, those smaller clusters will have a higher intra-cluster similarity than the one cluster they were derived from.

 

It's possible that a plot of the number of clusters versus intra-cluster similarity will have a change in steepness, when you've reached the number of clusters and start splitting good clusters. You would expect that the increase in intra-cluster similarity would be smaller when you split a good cluster as opposed to when you split a large bad cluster into two smaller clusters. However, this is just a heuristic, and can be difficult to determine by just looking at this plot.

 

The difficulty in determining the number of clusters is one of the large and still explored areas in clustering research. It's also why people really like methods such as dbscan, spectral clustering, or consensus clustering which seek to give the number of clusters during the clustering process.

 

From what you've mentioned in this post, I'd recommend a few possibilities:

1) If you have certain business rules that can help you narrow your search for the number of clusters, try to limit the search space this way first.

2) You can plot the intra-cluster similarity (on the y-axis) and the number of clusters on the x-axis. Look to see if the gains in intra-cluster similarity seem to taper off as you increase the number of clusters. This is a heuristic, and not guaranteed to happen, but if it does, that could give you an easy to see answer.

3) If you have a specific end use for the clusters, you can perform that analysis on each set of clusters, and pick the set of clusters that seem to give you the best results (this could overfit for your clusters though, and you might want a hold-out test set to help avoid over fitting)

4) While SAS does not have dbscan or some of these other methods I mentioned, some of them you can replicate using other procs and data step code. This is a bit more technical (requires deep understanding of the underlying clustering algorithms), and by far the most time consuming approach, but could provide useful insights if you have the time for it.

 

Hopefully this helps you get started.

 

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1609 views
  • 0 likes
  • 3 in conversation