06-03-2012 07:54 PM
I understand the idea od the CCC is to compare the R2 you get for a given set of clusters with the R2 you would get by clustering a unfoirmly distributed set of points in a p dimensional space. However what if I get negative values in the CCC plot but the peaks in the CCC plot still indicate a number of clusters that explains a good deal of variation (as evidenced by the corresponding R2 value with that number of clusters in the Cluster History table)? Please advise. Thanks!
06-18-2012 06:45 AM
The CCC is a statistic created by Warren Kuhfeld of SAS nearly 30 years ago. It is documented in Technical Report A-108. On page 48 he writes, "If all values of the CCC are negative and decreasing for two or more clusters, the distribution is probably unimodal or long-tailed." He goes on to say that very negative values may be due to outliers, which he recommends removing (not my recommended best practice). In my experience, the CCC is a heuristic that needs to be triangulated with the approximate R2 as well as the distribution of the cluster frequencies. For the CCC and R2, you want to look at their distribution across a set of solutions (e.g., wrap FASTCLUS in a macro and run solutions from 3 to 30) and examine solutions that have max values for those statistics, even when the CCC is negative. Clusters that are highly irregularly distributed or have 1 or 2 clusters that are large with several small clusters are not appropriate and do not lead to good solutions. In addition, it's important to note that FASTCLUS is a k-means algorithm, meaning that the clusters it produces are compact and spherical in shape. If the shape of your clusters is irregular, you may want to consider a different algorithm, e.g., a nonparametric approach.