Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- SAS Data Science
- /
- What are the characteristics of a good cluster?

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 05-03-2018 09:30 PM
(2495 views)

I am analysing a data set using Clustering Algorithm in SAS Enterprise Miner. I have tried to change various settings to get the best results. But,I dont really know how to decide whether the model created is an optimal one or not.

I have tried to compared it using various metrics: Number of Clusters, Frequency of Segments and CCC(Cubic Clustering Criterion).

```
when Internal Standardization -> None
Number of clusters is 43.
CCC -> 20.6302
Frequency of clusters ranges from 2 to 45
when Internal Standardization -> Standardization
Number of clusters is 20.
CCC -> 30.55503
Frequency of clusters ranges from 3-86
when Internal Standardization -> Range standardization
Number of clusters is 3.
CCC -> -7.68585
Frequency of clusters ranges from 52-289
```

CCC Cutoff is 3.

Any help would be really appreciated. Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

It seems that you actually have two questions here: 1) How do I compare two clustering results to determine which is optimal 2) How do I determine the number of clusters is optimal.

While 1) can be related to 2), if you want to compare a clustering result with 3 clusters vs a result with 20 clusters, I will mostly address these separately. I will have some specific details in my answer, but also more general points. I hope both help!

1) "How do I compare two clustering results to determine which is optimal"

As mentioned, by Ksharp, the Analysis of Variance is a useful metric to use when considering clustering metrics. You can use PROC GLM in an Enterprise Miner code node to do this.

Ultimately though, as clustering is an unsupervised task (ie there is no target variable used), I find that the meaning of "optimal" in the case of clustering can be problem dependent (even if the data is the same).

The way I like to approach the question is by first asking "what is the goal of clustering" for the context of the problem you're working on (what do you want the clusters to help you do?). For example, if it's a predictive modeling problem in which you want to develop models on each cluster separately, then the overall accuracy of your models across all the data will let you know how good the clustering is.

2) "How do I determine the number of clusters when using clustering"

One way, if you have SAS Enterprise Miner 13.1 or later, is the HP Cluster node under the HPDM tab. This node has a metric called the "Aligned Box Criterion" which automatically seeks to find the number of clusters for you.

Another method is called spectral clustering, which is looks at the eigenvalues of a similarity matrix to try to determine the number of clusters. While this is not implemented in Enterprise Miner, SAS does have the procedures so that you could implement it yourself using a SAS Code node, with data step and a procedure to get the principal components, followed by kmeans.

----

Finally, an idea to address both questions that is much more involved, is consensus clustering (which can be used with the two previous ideas for determining the number of clusters). The goal behind consensus clustering is to ensemble multiple clustering results into one (including results with different numbers of clusters). The reasoning for why you would want to ensemble is that if multiple clustering results overlap, then you feel confident that the areas of overlap are "correct" / "optimal." Again, this is not implemented in Enterprise Miner, and is quite involved. That being said, it is possible to do using SAS data step code and the procedures / nodes in Enterprise Miner.

Hopefully some of this helps, either immediately, or by giving you things to think about.

3 REPLIES 3

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

CCC is assuming data conform to uniform distribution , it can't apply to all the scenarios .

Why not use Analysis of Variance (PROC GLM) and check P-Value ?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you for the suggestion 🙂

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

It seems that you actually have two questions here: 1) How do I compare two clustering results to determine which is optimal 2) How do I determine the number of clusters is optimal.

While 1) can be related to 2), if you want to compare a clustering result with 3 clusters vs a result with 20 clusters, I will mostly address these separately. I will have some specific details in my answer, but also more general points. I hope both help!

1) "How do I compare two clustering results to determine which is optimal"

As mentioned, by Ksharp, the Analysis of Variance is a useful metric to use when considering clustering metrics. You can use PROC GLM in an Enterprise Miner code node to do this.

Ultimately though, as clustering is an unsupervised task (ie there is no target variable used), I find that the meaning of "optimal" in the case of clustering can be problem dependent (even if the data is the same).

The way I like to approach the question is by first asking "what is the goal of clustering" for the context of the problem you're working on (what do you want the clusters to help you do?). For example, if it's a predictive modeling problem in which you want to develop models on each cluster separately, then the overall accuracy of your models across all the data will let you know how good the clustering is.

2) "How do I determine the number of clusters when using clustering"

One way, if you have SAS Enterprise Miner 13.1 or later, is the HP Cluster node under the HPDM tab. This node has a metric called the "Aligned Box Criterion" which automatically seeks to find the number of clusters for you.

Another method is called spectral clustering, which is looks at the eigenvalues of a similarity matrix to try to determine the number of clusters. While this is not implemented in Enterprise Miner, SAS does have the procedures so that you could implement it yourself using a SAS Code node, with data step and a procedure to get the principal components, followed by kmeans.

----

Finally, an idea to address both questions that is much more involved, is consensus clustering (which can be used with the two previous ideas for determining the number of clusters). The goal behind consensus clustering is to ensemble multiple clustering results into one (including results with different numbers of clusters). The reasoning for why you would want to ensemble is that if multiple clustering results overlap, then you feel confident that the areas of overlap are "correct" / "optimal." Again, this is not implemented in Enterprise Miner, and is quite involved. That being said, it is possible to do using SAS data step code and the procedures / nodes in Enterprise Miner.

Hopefully some of this helps, either immediately, or by giving you things to think about.

**Available on demand!**

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.