BookmarkSubscribeRSS Feed
ozzy
Calcite | Level 5

Hi everybody,

I am trying to cluster raw return data into 10 clusters. Is it possible in Proc cluster to exogenously identify the number of clusters?

Thank you in advance for your help.

Ozzy

3 REPLIES 3
PGStats
Opal | Level 21

Not sure what you mean by exogenously identify the number of clusters. Proc cluster builds a binary tree, starting with every observation in its own cluster and at every step joining two cluster togetter until there is only one cluster. You can pick the level of clustering that you want by trimming that tree at the appropriate level. The trimming can be done with Proc tree. For example :


proc cluster data=test outtree=tree method=centroid;
var x y z;
id id;
run;

proc tree data=tree out=clusters nclusters=10;

run;

PG

PG
ozzy
Calcite | Level 5

This is exactly what I needed; thank you very much.

HarshadMadhamshettiwar
Obsidian | Level 7

Hi,

I have a question here how did you land upon the number "10" even before running proc cluster, i.e. how did you decided to create 10 clusters. What was the motivation behind that, was a business requirement? Because otherwise pre-deciding the number of clusters in impossible and scientifically incorrect.


Now, I am in a situation where I have to use Hierarchical Cluster analysis but I am not being able to decide the number of clusters. I see Proc ACECLUS which says


"Neither cluster membership nor the number of clusters needs to be known. PROC ACECLUS is useful for preprocessing data to be subsequently clustered by the CLUSTER or FASTCLUS procedure"

But when I see the example provided (LONE example) in documentation section it uses "MAXC=3" option (which is offcourse mandatory requirement of FASTCLUS procedure and is like providing number of cluster explicitly - SAS/STAT(R) 9.2 User's Guide, Second Edition) if it is to be that way then what is the use of running ACECLUS when we are giving the number of clusters explicitly and why then it is quoted in above sentence number of cluster need not to be known. I am confused.

Nevertheless main question  is can we use FASTCLUS or CLUSTER procedure without Prior running ACECLUS (I think the answer is yes). But ACECLUS has got its own importance for calculating canonical variables if our dataset that have variables with different scalar measures. And if we use ACECLUS first, then how to arrive at desired number of clusters given that user is novice and is not aware of different algorithms and methods and business needs etc etc.

Thanks.

Harshad M.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1323 views
  • 0 likes
  • 3 in conversation