BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
GarlandJaeger
Calcite | Level 5

I understand the idea od the CCC is to compare the R2 you get for a given set of clusters with the R2 you would get by clustering a unfoirmly distributed set of points in a p dimensional space. However what if I get negative values in the CCC plot but the peaks in the CCC plot still indicate a number of clusters that explains a good deal of variation (as evidenced by the corresponding R2 value with that number of clusters in the Cluster History table)? Please advise. Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
xtc283
Fluorite | Level 6

The CCC is a statistic created by Warren Sarle of SAS nearly 30 years ago.  It is documented in Technical Report A-108.  On page 48 he writes, "If all values of the CCC are negative and decreasing for two or more clusters, the distribution is probably unimodal or long-tailed."  He goes on to say that very negative values may be due to outliers, which he recommends removing (not my recommended best practice).  In my experience, the CCC is a heuristic that needs to be triangulated with the approximate R2 as well as the distribution of the cluster frequencies.  For the CCC and R2, you want to look at their distribution across a set of solutions (e.g., wrap FASTCLUS in a macro and run solutions from 3 to 30) and examine solutions that have max values for those statistics, even when the CCC is negative.  Clusters that are highly irregularly distributed or have 1 or 2 clusters that are large with several small clusters are not appropriate and do not lead to good solutions.  In addition, it's important to note that FASTCLUS is a k-means algorithm, meaning that the clusters it produces are compact and spherical in shape.  If the shape of your clusters is irregular, you may want to consider a different algorithm, e.g., a nonparametric approach.

View solution in original post

3 REPLIES 3
xtc283
Fluorite | Level 6

The CCC is a statistic created by Warren Sarle of SAS nearly 30 years ago.  It is documented in Technical Report A-108.  On page 48 he writes, "If all values of the CCC are negative and decreasing for two or more clusters, the distribution is probably unimodal or long-tailed."  He goes on to say that very negative values may be due to outliers, which he recommends removing (not my recommended best practice).  In my experience, the CCC is a heuristic that needs to be triangulated with the approximate R2 as well as the distribution of the cluster frequencies.  For the CCC and R2, you want to look at their distribution across a set of solutions (e.g., wrap FASTCLUS in a macro and run solutions from 3 to 30) and examine solutions that have max values for those statistics, even when the CCC is negative.  Clusters that are highly irregularly distributed or have 1 or 2 clusters that are large with several small clusters are not appropriate and do not lead to good solutions.  In addition, it's important to note that FASTCLUS is a k-means algorithm, meaning that the clusters it produces are compact and spherical in shape.  If the shape of your clusters is irregular, you may want to consider a different algorithm, e.g., a nonparametric approach.

Rick_SAS
SAS Super FREQ

Small correction: The CCC statistic is based on research by Warren Sarle, not Warren Kuhfeld.

WarrenKuhfeld
Rhodochrosite | Level 12

I always confuse those two myself.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 9661 views
  • 1 like
  • 4 in conversation