04-18-2017 08:02 AM - edited 04-18-2017 08:04 AM
Hello, SAS Community!
My questions are all about clustering in SAS Miner, version 12.1:
- Upon running k-means clustering in SAS Miner, can we determine the optimal number of clusters up front?
- When we set the Specification Method to Automatic and perform hierarchical clustering, how do we decide which clustering method (Average, Centroid, or Ward) produced the best outputs?
Generally, which statistics I should be looking at and how do I interpret them for both types of clustering?
- Is there really "the best" number of clusters? Perhaps, there could be perfectly divided clusters from mathematical persepctive, say 7, (all observations are closest to each other and distances between clusters are the largest), but they do not provide any useful information with regards to the analysis objectives. Then, say, we start selecting user-specified number of cluster (4, 5, and so on). One of them really shows interesting results. Should such guessing be disregarded completely?
- How are the number of clusters and clusters themselves going to change when we normalize the data (Transform node => Formulas => Using log transformation for, say, yearly revenue in thousands)?
- When is it necessary to remove outliers from our input variables?
I will be very grateful for all answers!
If someone could share a title of a thorough textbook on cluster analysis in SAS Miner, that would be of great help, too!
04-19-2017 01:03 PM - edited 04-19-2017 01:05 PM
Hi, welcome to SAS.
This is a very interesting question of how to determine the number of clusters for a specific data. The 'proper' number of clusters is not only determined by the data itself, but also influenced by the emphasis we put on during the analysis on the data.
- For k-means, one of the most distinct issues of it is the requirement to set the number of clusters before the clustering process. k-means algorithm does not provide an estimate of the number of clusters as an output. On the contrary, we need to feed the number of clusters (the 'k') into the k-means for it to work.
- All the clustering criterion, Average, Centroid, or Ward etc., have distinct definitions on the distance between clusters. To choose which criterion depends on uses' specific perspective in analyzing the data. For example, 'average linkage' take the distances of each pair of data in 2 clusters as the distance between them. It is a commonly-chosen method and it may also fit your case. The detailed descriptions about these 11 methods, the definitions, focuses, and references, can be found in the SAS document as below. You may want to read it and compare the advantages and disadvantages and choose the best one for your case.
- While in SAS, there is a procedure 'kClus' which provides a Aligned Box Criterion, the 'ABC' algorithm, to heuristically determine the number of clusters before launching a k-means clustering. (See 'proc kclus' SAS document.)
- Yes it is very difficult to say which number is the best number of clusters in the data. Indeed, it sometimes difficult to say there are any clusters in the data at all. Few datasets present perfect/uncontroversial clustering structures by themselves. All clustering processes correspond to dimension reduction and information loss to some sense. At the same time, 'soft clustering' like Gaussian mixture model, in a contrast to 'hard clustering' like k-means, yields a probability distribution on all the clusters for each data, which may give more insights than a only cluster index given by the hard clustering in some situations.
- The preprocessing of the data before clustering usually will change the clusters structure in them. Since the transformation may distort the defined distance between the data in the space where they live. For example, any non-isometric transformation in the Euclidean space change the clusters structure in the data.