Hi everybody,
I am trying to cluster raw return data into 10 clusters. Is it possible in Proc cluster to exogenously identify the number of clusters?
Thank you in advance for your help.
Ozzy
Not sure what you mean by exogenously identify the number of clusters. Proc cluster builds a binary tree, starting with every observation in its own cluster and at every step joining two cluster togetter until there is only one cluster. You can pick the level of clustering that you want by trimming that tree at the appropriate level. The trimming can be done with Proc tree. For example :
proc cluster data=test outtree=tree method=centroid;
var x y z;
id id;
run;
proc tree data=tree out=clusters nclusters=10;
run;
PG
This is exactly what I needed; thank you very much.
Hi,
I have a question here how did you land upon the number "10" even before running proc cluster, i.e. how did you decided to create 10 clusters. What was the motivation behind that, was a business requirement? Because otherwise pre-deciding the number of clusters in impossible and scientifically incorrect.
Now, I am in a situation where I have to use Hierarchical Cluster analysis but I am not being able to decide the number of clusters. I see Proc ACECLUS which says
"Neither cluster membership nor the number of clusters needs to be known. PROC ACECLUS is useful for preprocessing data to be subsequently clustered by the CLUSTER or FASTCLUS procedure"
But when I see the example provided (LONE example) in documentation section it uses "MAXC=3" option (which is offcourse mandatory requirement of FASTCLUS procedure and is like providing number of cluster explicitly - SAS/STAT(R) 9.2 User's Guide, Second Edition) if it is to be that way then what is the use of running ACECLUS when we are giving the number of clusters explicitly and why then it is quoted in above sentence number of cluster need not to be known. I am confused.
Nevertheless main question is can we use FASTCLUS or CLUSTER procedure without Prior running ACECLUS (I think the answer is yes). But ACECLUS has got its own importance for calculating canonical variables if our dataset that have variables with different scalar measures. And if we use ACECLUS first, then how to arrive at desired number of clusters given that user is novice and is not aware of different algorithms and methods and business needs etc etc.
Thanks.
Harshad M.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.