Re: Cluster or FastClus

GSRodney · Posted 02-05-2009 03:50 PM

I am wanting to use either PROC CLUSTER or FASTCLUS to determine if my data can be grouped and if so what is the best grouping. A colleague ran this for me on a different stat package using k-means dynamic for 10, 8,6,4,3,2 groups and so on. He took the output and plotted #groups vs the RMSE for each. where the line inflected represented the optimal grouping. When I run FASTCLUS or CLUSTER, I don't see RMSE to do a similar check. How or what do I use in the SAS output for these PROCs to determine when the cluster numbers is the best that it can be? Is there a metric to gage this with?

Thanks.

mjbstats · Posted 11-08-2010 10:42 AM

Hello GSRodney,

Are you still uncertain about these procedures? I am also. I am new to clustering and am also trying to "match" results using another software program. In the example I am attempting to match, several scenarios were run and the between/within cluster variance for each was calculated. Where those ratios seem to hit a point of diminishing returns (in that additional clusters does not differentiate clusters well enough anymore in comparison to the within-cluster variance), an optimal number of clusters begins to appear. Chosing the actual # of clusters is a somewhat subjective process.

BTW, the between/within ratios seem to have been calculated offline with Excel--my application involves fewer than 1,000 clustered values and only 1 dependent variable.

Anyway, if you have any additional insight on clustering analysis, measures for choosing numbers of cluster, and SAS procs, please share!

Thanks.

mjbstats · Posted 11-08-2010 10:43 AM

Hello GSRodney,

Are you still uncertain about these procedures? I am also. I am new to clustering and am also trying to "match" results using another software program. In the example I am attempting to match, several scenarios were run and the between/within cluster variance for each was calculated. Where those ratios seem to hit a point of diminishing returns (in that additional clusters does not differentiate clusters well enough anymore in comparison to the within-cluster variance), an optimal number of clusters begins to appear. Chosing the actual # of clusters is a somewhat subjective process.

BTW, the between/within ratios seem to have been calculated offline with Excel--my application involves fewer than 1,000 clustered values and only 1 dependent variable.

Anyway, if you have any additional insight on clustering analysis, measures for choosing numbers of cluster, and SAS procs, please share!

Thanks.

Ksharp · Posted 11-09-2010 03:13 AM

Hi.I remebered There is likely a statistical estimator(but i forgot. 😞 ) to decide how many cluster.
Before using proc cluster/fastclus ,Recommend to use proc princomp and proc gplort to plot the two prin1 and prin2 to decide how many clusters you want.
And there is not best criteria to decide the number of clusters, different method would yield different cluster .

Ksharp Message was edited by: Ksharp

mjbstats · Posted 11-10-2010 10:41 AM

OK, now to show my ignorance (if I haven't already). I have no experience with PRINCOMP. Why to run and what do the "1" and "2" you referenced estimate?

Ksharp · Posted 11-10-2010 08:45 PM

Hi.
Don't say so.I am also a beginner for SAS statistical method.
proc PRINCOMP do the principle component analysis which is the oldest multi-variables analysis can use two prin stand for the multi-variables data based on covariance matrix.
Then use these two prin as x-axis and y-axis, ploting the observations in this coordination.
and you will find some obs very close and some obs very far.
Recommend you to look up the SAS documentation about proc princomp.

p.s. these two prin demonstrate the the variance this obs can explain.

Ksharp

mjbstats · Posted 11-11-2010 08:43 AM

Thank you very much for your insights, KSharp. I will look at the SAS doc'n for PRINCOMP.

goladin · Posted 11-09-2010 06:47 AM

Hi,

The stats that you want is CCC, which stands for cubic clustering criterion. Proc Clusters measures the distance between the various points and produces the CCC and Pseudo R Squares. Fastclus basically implements the K-Means Algorithm.

Regards,
Murphy

mjbstats · Posted 11-10-2010 10:45 AM

Hello,

Can you elaborate on the CCC, and what it means? Also the Pseudo R-Square...

I thought K-means was OK for my application, but admit to some fogginess re: hierarchical vs. disjoint clustering methods.

(I chose FASTCLUS because I thought I wanted disjoint and the ease of specifying number of clusters--but better understanding doesn't mean best procedure for my simple data.)

Thank you!

mjbstats · Posted 11-11-2010 10:49 AM

BTW, I have found SAS Technical Report A-108, Cubic Clustering Criterion, and Usage Note 22540: "How can I tell how many clusters...?" to be very useful.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away