BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
GarlandJaeger
Calcite | Level 5

I understand the idea od the CCC is to compare the R2 you get for a given set of clusters with the R2 you would get by clustering a unfoirmly distributed set of points in a p dimensional space. However what if I get negative values in the CCC plot but the peaks in the CCC plot still indicate a number of clusters that explains a good deal of variation (as evidenced by the corresponding R2 value with that number of clusters in the Cluster History table)? Please advise. Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
xtc283
Fluorite | Level 6

The CCC is a statistic created by Warren Sarle of SAS nearly 30 years ago.  It is documented in Technical Report A-108.  On page 48 he writes, "If all values of the CCC are negative and decreasing for two or more clusters, the distribution is probably unimodal or long-tailed."  He goes on to say that very negative values may be due to outliers, which he recommends removing (not my recommended best practice).  In my experience, the CCC is a heuristic that needs to be triangulated with the approximate R2 as well as the distribution of the cluster frequencies.  For the CCC and R2, you want to look at their distribution across a set of solutions (e.g., wrap FASTCLUS in a macro and run solutions from 3 to 30) and examine solutions that have max values for those statistics, even when the CCC is negative.  Clusters that are highly irregularly distributed or have 1 or 2 clusters that are large with several small clusters are not appropriate and do not lead to good solutions.  In addition, it's important to note that FASTCLUS is a k-means algorithm, meaning that the clusters it produces are compact and spherical in shape.  If the shape of your clusters is irregular, you may want to consider a different algorithm, e.g., a nonparametric approach.

View solution in original post

3 REPLIES 3
xtc283
Fluorite | Level 6

The CCC is a statistic created by Warren Sarle of SAS nearly 30 years ago.  It is documented in Technical Report A-108.  On page 48 he writes, "If all values of the CCC are negative and decreasing for two or more clusters, the distribution is probably unimodal or long-tailed."  He goes on to say that very negative values may be due to outliers, which he recommends removing (not my recommended best practice).  In my experience, the CCC is a heuristic that needs to be triangulated with the approximate R2 as well as the distribution of the cluster frequencies.  For the CCC and R2, you want to look at their distribution across a set of solutions (e.g., wrap FASTCLUS in a macro and run solutions from 3 to 30) and examine solutions that have max values for those statistics, even when the CCC is negative.  Clusters that are highly irregularly distributed or have 1 or 2 clusters that are large with several small clusters are not appropriate and do not lead to good solutions.  In addition, it's important to note that FASTCLUS is a k-means algorithm, meaning that the clusters it produces are compact and spherical in shape.  If the shape of your clusters is irregular, you may want to consider a different algorithm, e.g., a nonparametric approach.

Rick_SAS
SAS Super FREQ

Small correction: The CCC statistic is based on research by Warren Sarle, not Warren Kuhfeld.

WarrenKuhfeld
Rhodochrosite | Level 12

I always confuse those two myself.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 9704 views
  • 1 like
  • 4 in conversation