The Cluster Node in SAS® Enterprise Miner™ gives you two options for specifying the number of clusters. You can specify a fixed number of clusters or let Enterprise Miner find the number of clusters for you by selecting the “Automatic” option.
If you use the "Automatic" option, the options under “Selection Criterion” allow you to set different parameters and methods used in the automatic selection process.
Figure 1: Screenshot of the options
In this tip I will explain the available clustering methods used in the automatic selection process. But before we begin, let me summarize how the automatic selection process works in the Cluster Node:
If none of these criteria are met, then the number of clusters is set to the first local peak.
Agglomerative hierarchical clustering starts with each observation in preliminary clusters and combines the clusters that are most similar in terms of distance between the clusters at each stage. The inter-cluster distance measure can be calculated in many ways. In the Cluster Node, three methods are available:
The graphs below show how these methods perform for some simulated data sets having different characteristics. Each dataset has 100 points and I used 50 (the default) as the initial number of cluster seeds in the first step. The other half of the points were used for estimating the number of clusters.
Figure 2: Input data set with 3 clusters
Figure 2: Input data set with 3 poorly separated clusters
For data sets containing clusters with these characteristics, different methods with different parameters need to be investigated. In this case, Ward’s method found 7 clusters:
Figure 3: Input data set grouped into 7 clusters using Ward`s method
By comparison, the Average method found 6 clusters, while the Centroid method found 9 clusters.
When we increase the “Preliminary Maximum” to 100 (basically using all points to estimate the number of clusters), the Ward and Average methods both found 3 as the number of clusters, while the Centroid method found 5.
Figure 4: Input data set grouped into 3 clusters using Ward`s method
Figure 5: Input data set grouped into 5 clusters using Centroid method
Figure 6: Input data set with three multinormal clusters that differ in size and dispersion.
For this data set, the Ward and Average methods both found 5 clusters, while the Centroid method found 4 clusters.
Figure 7: Input data set grouped into 5 clusters using Ward`s method
Figure 8: Input data set grouped into 5 clusters using Centroid method
When we increase the “Preliminary Maximum” to 100, all methods find 5 as number of clusters.
Conclusions
We`ve seen how to use the Automatic selection method in Cluster Node. Each of the methods discussed here has its own advantages. Here are some rules of thumb that can help you choose the right method for your data:
When the clusters have elongated or irregular shapes, consider transforming the input variables before clustering. Some transformations on variables can generate more spherical clusters, which can be more easily detected by the Cluster Node.
IT is a great guide, thanks
Glad you found this useful. Finding the number of clusters in a data set is a very challenging problem. I think understanding these options is really important to figure out how you should set them. I highly encourage you to check the CCC plot from the results section after you run the node. As a side information, you can also look at the tip: This tip explains another way (briefly) to find the number of clusters using the NOC=ABC option in proc HPCLUS.
Question: how did you create the scatter plots (variables used)?
I used %em_report to create the scattered plots. I used _SEGMENT_ for the GROUP option. Below is an example. Please let me know if you have any questions.
%em_register(type=Data,key=MyKey2);
data &em_user_MyKey2;
set &em_import_Data;
run;
%em_report(KEY=MyKey2,VIEWTYPE=scatter,X=x,Y=y,GROUP=_SEGMENT_, DESCRIPTION=AutoAverage , AUTODISPLAY=y);
hello,
You said: "After the number of clusters is determined, the clusters are obtained using a k-means algorithm."
i want to know : which clustering Method is really performs in Cluster Node ? hierarchical cluster algorithm or non hierarchical ?
because your colleague tell me that cluster node performs a hierarchical cluster algorithm
how to create a scatter plot for data with five variable in Sas EMINER?
we don 't have a canonical node
How to create the dendogram from the clusters output of SAS EM? I checked the tree but I want to look at dendogram tree diagram or line printer of clusters to variables ? would you please provide info on which cluster export datasets to utilize to produce the tree diagrams?
@ilknurkabul Hey, thanks for sharing the code. I'm currently trying to implement this in a project, however would like some clarity on the code. Could you explain the data you have passed as MyKey2 in the first line of your snippet?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.