04-29-2012 06:23 PM
I am trying to cluster raw return data into 10 clusters. Is it possible in Proc cluster to exogenously identify the number of clusters?
Thank you in advance for your help.
04-29-2012 08:19 PM
Not sure what you mean by exogenously identify the number of clusters. Proc cluster builds a binary tree, starting with every observation in its own cluster and at every step joining two cluster togetter until there is only one cluster. You can pick the level of clustering that you want by trimming that tree at the appropriate level. The trimming can be done with Proc tree. For example :
proc cluster data=test outtree=tree method=centroid;
var x y z;
proc tree data=tree out=clusters nclusters=10;
01-27-2015 09:07 AM
I have a question here how did you land upon the number "10" even before running proc cluster, i.e. how did you decided to create 10 clusters. What was the motivation behind that, was a business requirement? Because otherwise pre-deciding the number of clusters in impossible and scientifically incorrect.
Now, I am in a situation where I have to use Hierarchical Cluster analysis but I am not being able to decide the number of clusters. I see Proc ACECLUS which says
"Neither cluster membership nor the number of clusters needs to be known. PROC ACECLUS is useful for preprocessing data to be subsequently clustered by the CLUSTER or FASTCLUS procedure"
But when I see the example provided (LONE example) in documentation section it uses "MAXC=3" option (which is offcourse mandatory requirement of FASTCLUS procedure and is like providing number of cluster explicitly - SAS/STAT(R) 9.2 User's Guide, Second Edition) if it is to be that way then what is the use of running ACECLUS when we are giving the number of clusters explicitly and why then it is quoted in above sentence number of cluster need not to be known. I am confused.
Nevertheless main question is can we use FASTCLUS or CLUSTER procedure without Prior running ACECLUS (I think the answer is yes). But ACECLUS has got its own importance for calculating canonical variables if our dataset that have variables with different scalar measures. And if we use ACECLUS first, then how to arrive at desired number of clusters given that user is novice and is not aware of different algorithms and methods and business needs etc etc.