Hi, welcome to SAS.
This is a very interesting question of how to determine the number of clusters for a specific data. The 'proper' number of clusters is not only determined by the data itself, but also influenced by the emphasis we put on during the analysis on the data.
- For k-means, one of the most distinct issues of it is the requirement to set the number of clusters before the clustering process. k-means algorithm does not provide an estimate of the number of clusters as an output. On the contrary, we need to feed the number of clusters (the 'k') into the k-means for it to work.
- All the clustering criterion, Average, Centroid, or Ward etc., have distinct definitions on the distance between clusters. To choose which criterion depends on uses' specific perspective in analyzing the data. For example, 'average linkage' take the distances of each pair of data in 2 clusters as the distance between them. It is a commonly-chosen method and it may also fit your case. The detailed descriptions about these 11 methods, the definitions, focuses, and references, can be found in the SAS document as below. You may want to read it and compare the advantages and disadvantages and choose the best one for your case.
https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_cluster_sect012.htm
- While in SAS, there is a procedure 'kClus' which provides a Aligned Box Criterion, the 'ABC' algorithm, to heuristically determine the number of clusters before launching a k-means clustering. (See 'proc kclus' SAS document.)
- Yes it is very difficult to say which number is the best number of clusters in the data. Indeed, it sometimes difficult to say there are any clusters in the data at all. Few datasets present perfect/uncontroversial clustering structures by themselves. All clustering processes correspond to dimension reduction and information loss to some sense. At the same time, 'soft clustering' like Gaussian mixture model, in a contrast to 'hard clustering' like k-means, yields a probability distribution on all the clusters for each data, which may give more insights than a only cluster index given by the hard clustering in some situations.
- The preprocessing of the data before clustering usually will change the clusters structure in them. Since the transformation may distort the defined distance between the data in the space where they live. For example, any non-isometric transformation in the Euclidean space change the clusters structure in the data.
... View more