BookmarkSubscribeRSS Feed
GSRodney
Calcite | Level 5
I am wanting to use either PROC CLUSTER or FASTCLUS to determine if my data can be grouped and if so what is the best grouping. A colleague ran this for me on a different stat package using k-means dynamic for 10, 8,6,4,3,2 groups and so on. He took the output and plotted #groups vs the RMSE for each. where the line inflected represented the optimal grouping. When I run FASTCLUS or CLUSTER, I don't see RMSE to do a similar check. How or what do I use in the SAS output for these PROCs to determine when the cluster numbers is the best that it can be? Is there a metric to gage this with?

Thanks.
9 REPLIES 9
mjbstats
Calcite | Level 5
Hello GSRodney,

Are you still uncertain about these procedures? I am also. I am new to clustering and am also trying to "match" results using another software program. In the example I am attempting to match, several scenarios were run and the between/within cluster variance for each was calculated. Where those ratios seem to hit a point of diminishing returns (in that additional clusters does not differentiate clusters well enough anymore in comparison to the within-cluster variance), an optimal number of clusters begins to appear. Chosing the actual # of clusters is a somewhat subjective process.

BTW, the between/within ratios seem to have been calculated offline with Excel--my application involves fewer than 1,000 clustered values and only 1 dependent variable.

Anyway, if you have any additional insight on clustering analysis, measures for choosing numbers of cluster, and SAS procs, please share!

Thanks.
mjbstats
Calcite | Level 5
Hello GSRodney,

Are you still uncertain about these procedures? I am also. I am new to clustering and am also trying to "match" results using another software program. In the example I am attempting to match, several scenarios were run and the between/within cluster variance for each was calculated. Where those ratios seem to hit a point of diminishing returns (in that additional clusters does not differentiate clusters well enough anymore in comparison to the within-cluster variance), an optimal number of clusters begins to appear. Chosing the actual # of clusters is a somewhat subjective process.

BTW, the between/within ratios seem to have been calculated offline with Excel--my application involves fewer than 1,000 clustered values and only 1 dependent variable.

Anyway, if you have any additional insight on clustering analysis, measures for choosing numbers of cluster, and SAS procs, please share!

Thanks.
Ksharp
Super User
Hi.I remebered There is likely a statistical estimator(but i forgot. 😞 ) to decide how many cluster.
Before using proc cluster/fastclus ,Recommend to use proc princomp and proc gplort to plot the two prin1 and prin2 to decide how many clusters you want.
And there is not best criteria to decide the number of clusters, different method would yield different cluster .


Ksharp Message was edited by: Ksharp
mjbstats
Calcite | Level 5
OK, now to show my ignorance (if I haven't already). I have no experience with PRINCOMP. Why to run and what do the "1" and "2" you referenced estimate?
Ksharp
Super User
Hi.
Don't say so.I am also a beginner for SAS statistical method.
proc PRINCOMP do the principle component analysis which is the oldest multi-variables analysis can use two prin stand for the multi-variables data based on covariance matrix.
Then use these two prin as x-axis and y-axis, ploting the observations in this coordination.
and you will find some obs very close and some obs very far.
Recommend you to look up the SAS documentation about proc princomp.

p.s. these two prin demonstrate the the variance this obs can explain.


Ksharp
mjbstats
Calcite | Level 5
Thank you very much for your insights, KSharp. I will look at the SAS doc'n for PRINCOMP.
goladin
Calcite | Level 5
Hi,

The stats that you want is CCC, which stands for cubic clustering criterion. Proc Clusters measures the distance between the various points and produces the CCC and Pseudo R Squares. Fastclus basically implements the K-Means Algorithm.

Regards,
Murphy
mjbstats
Calcite | Level 5
Hello,

Can you elaborate on the CCC, and what it means? Also the Pseudo R-Square...

I thought K-means was OK for my application, but admit to some fogginess re: hierarchical vs. disjoint clustering methods.

(I chose FASTCLUS because I thought I wanted disjoint and the ease of specifying number of clusters--but better understanding doesn't mean best procedure for my simple data.)

Thank you!
mjbstats
Calcite | Level 5
BTW, I have found SAS Technical Report A-108, Cubic Clustering Criterion, and Usage Note 22540: "How can I tell how many clusters...?" to be very useful.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2609 views
  • 0 likes
  • 4 in conversation