Solved: Clustering in SAS Miner: Number of clusters determination, input data,...

inna1808 · Posted 04-18-2017 08:02 AM

Hello, SAS Community!

My questions are all about clustering in SAS Miner, version 12.1:

- Upon running k-means clustering in SAS Miner, can we determine the optimal number of clusters up front?

- When we set the Specification Method to Automatic and perform hierarchical clustering, how do we decide which clustering method (Average, Centroid, or Ward) produced the best outputs?

Generally, which statistics I should be looking at and how do I interpret them for both types of clustering?

Also:

- Is there really "the best" number of clusters? Perhaps, there could be perfectly divided clusters from mathematical persepctive, say 7, (all observations are closest to each other and distances between clusters are the largest), but they do not provide any useful information with regards to the analysis objectives. Then, say, we start selecting user-specified number of cluster (4, 5, and so on). One of them really shows interesting results. Should such guessing be disregarded completely?

- How are the number of clusters and clusters themselves going to change when we normalize the data (Transform node => Formulas => Using log transformation for, say, yearly revenue in thousands)?

- When is it necessary to remove outliers from our input variables?

I will be very grateful for all answers!

If someone could share a title of a thorough textbook on cluster analysis in SAS Miner, that would be of great help, too!

RalphAbbey · Posted 05-02-2018 10:28 AM

Yingjian's post has some very good points. I wanted to add a bit more, and also some about clustering in Enterprise Miner.

In Enterprise Miner there is the "Cluster" node that is under the Explore tab. This node uses PROC CLUSTER to compute the clustering. In this node there is a Cubic Clustering Criterion (CCC) that attempts to determine the number of clusters while performing the analysis. In general, there aren't many ways to accurately gain a good view of how many clusters there should be a priori, unless information is known about the data before hand.

The best results (average, centroid, ward) requires your definition of best. In centroid based methods, many people will try to define best by looking at the total sum of distances from points to their respective centroid, but in non-centroid based methods this is no longer a useful measure. Ultimately I think the results of your later analysis may be how you want to determine which of the clustering results was "best."

Also available in Enterprise Miner 12.3 and on, there is the "HP Cluster" node that is under the HPDM tab. This node uses PROC HPCLUS to run kmeans clustering. PROC HPCLUS does have the Aligned Box Criterion (ABC) that Yingjian mentioned to determine the number of clusters. If you have the chance to try the HP Cluster node, you may find that it has some capabilities that you would find useful.

---

Also, to address the "best" number of clusters question, Yingjian is correct in that it is very difficult to say what number is best. Even defining what best means in the clustering context can be difficult.

If you are unsatisfied with the results of a single clustering, there is an approach that you can try called consensus clustering. This approach is to cluster the data multiple times and attempt to ensemble the results of all the clustering runs into one final clustering. Enterprise Miner has no node that will do this for you automatically, but you can do this using multiple PROC CLUSTER calls and some other data step code. This would require a SAS code node, but is an interesting approach if you're looking to do something more (it might require a bit of research to get started though).

View solution in original post

YingjianWang · Posted 04-19-2017 01:03 PM

Hi, welcome to SAS.

This is a very interesting question of how to determine the number of clusters for a specific data. The 'proper' number of clusters is not only determined by the data itself, but also influenced by the emphasis we put on during the analysis on the data.

- For k-means, one of the most distinct issues of it is the requirement to set the number of clusters before the clustering process. k-means algorithm does not provide an estimate of the number of clusters as an output. On the contrary, we need to feed the number of clusters (the 'k') into the k-means for it to work.

- All the clustering criterion, Average, Centroid, or Ward etc., have distinct definitions on the distance between clusters. To choose which criterion depends on uses' specific perspective in analyzing the data. For example, 'average linkage' take the distances of each pair of data in 2 clusters as the distance between them. It is a commonly-chosen method and it may also fit your case. The detailed descriptions about these 11 methods, the definitions, focuses, and references, can be found in the SAS document as below. You may want to read it and compare the advantages and disadvantages and choose the best one for your case.

https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_cluster_sec...

- While in SAS, there is a procedure 'kClus' which provides a Aligned Box Criterion, the 'ABC' algorithm, to heuristically determine the number of clusters before launching a k-means clustering. (See 'proc kclus' SAS document.)

- Yes it is very difficult to say which number is the best number of clusters in the data. Indeed, it sometimes difficult to say there are any clusters in the data at all. Few datasets present perfect/uncontroversial clustering structures by themselves. All clustering processes correspond to dimension reduction and information loss to some sense. At the same time, 'soft clustering' like Gaussian mixture model, in a contrast to 'hard clustering' like k-means, yields a probability distribution on all the clusters for each data, which may give more insights than a only cluster index given by the hard clustering in some situations.

- The preprocessing of the data before clustering usually will change the clusters structure in them. Since the transformation may distort the defined distance between the data in the space where they live. For example, any non-isometric transformation in the Euclidean space change the clusters structure in the data.

RalphAbbey · Posted 05-02-2018 10:28 AM

Yingjian's post has some very good points. I wanted to add a bit more, and also some about clustering in Enterprise Miner.

In Enterprise Miner there is the "Cluster" node that is under the Explore tab. This node uses PROC CLUSTER to compute the clustering. In this node there is a Cubic Clustering Criterion (CCC) that attempts to determine the number of clusters while performing the analysis. In general, there aren't many ways to accurately gain a good view of how many clusters there should be a priori, unless information is known about the data before hand.

The best results (average, centroid, ward) requires your definition of best. In centroid based methods, many people will try to define best by looking at the total sum of distances from points to their respective centroid, but in non-centroid based methods this is no longer a useful measure. Ultimately I think the results of your later analysis may be how you want to determine which of the clustering results was "best."

Also available in Enterprise Miner 12.3 and on, there is the "HP Cluster" node that is under the HPDM tab. This node uses PROC HPCLUS to run kmeans clustering. PROC HPCLUS does have the Aligned Box Criterion (ABC) that Yingjian mentioned to determine the number of clusters. If you have the chance to try the HP Cluster node, you may find that it has some capabilities that you would find useful.

---

Also, to address the "best" number of clusters question, Yingjian is correct in that it is very difficult to say what number is best. Even defining what best means in the clustering context can be difficult.

If you are unsatisfied with the results of a single clustering, there is an approach that you can try called consensus clustering. This approach is to cluster the data multiple times and attempt to ensemble the results of all the clustering runs into one final clustering. Enterprise Miner has no node that will do this for you automatically, but you can do this using multiple PROC CLUSTER calls and some other data step code. This would require a SAS code node, but is an interesting approach if you're looking to do something more (it might require a bit of research to get started though).

Clustering in SAS Miner: Number of clusters determination, input data, and results interpretation

Re: Clustering in SAS Miner: Number of clusters determination, input data, and results interpretatio

Re: Clustering in SAS Miner: Number of clusters determination, input data, and results interpretatio

Re: Clustering in SAS Miner: Number of clusters determination, input data, and results interpretatio

Clustering in SAS Miner: Number of clusters determination, input data, and results interpretation

Re: Clustering in SAS Miner: Number of clusters determination, input data, and results interpretatio

Re: Clustering in SAS Miner: Number of clusters determination, input data, and results interpretatio

Re: Clustering in SAS Miner: Number of clusters determination, input data, and results interpretatio

The 2025 SAS Hackathon has begun!