The Cluster Node in SAS® Enterprise Miner™ gives you two options for specifying the number of clusters. You can specify a fixed number of clusters or let Enterprise Miner find the number of clusters for you by selecting the “Automatic” option.
If you use the "Automatic" option, the options under “Selection Criterion” allow you to set different parameters and methods used in the automatic selection process.
Figure 1: Screenshot of the options
In this tip I will explain the available clustering methods used in the automatic selection process. But before we begin, let me summarize how the automatic selection process works in the Cluster Node:
If none of these criteria are met, then the number of clusters is set to the first local peak.
Agglomerative hierarchical clustering starts with each observation in preliminary clusters and combines the clusters that are most similar in terms of distance between the clusters at each stage. The inter-cluster distance measure can be calculated in many ways. In the Cluster Node, three methods are available:
The graphs below show how these methods perform for some simulated data sets having different characteristics. Each dataset has 100 points and I used 50 (the default) as the initial number of cluster seeds in the first step. The other half of the points were used for estimating the number of clusters.
Figure 2: Input data set with 3 clusters
Figure 2: Input data set with 3 poorly separated clusters
For data sets containing clusters with these characteristics, different methods with different parameters need to be investigated. In this case, Ward’s method found 7 clusters:
Figure 3: Input data set grouped into 7 clusters using Ward`s method
By comparison, the Average method found 6 clusters, while the Centroid method found 9 clusters.
When we increase the “Preliminary Maximum” to 100 (basically using all points to estimate the number of clusters), the Ward and Average methods both found 3 as the number of clusters, while the Centroid method found 5.
Figure 4: Input data set grouped into 3 clusters using Ward`s method
Figure 5: Input data set grouped into 5 clusters using Centroid method
Figure 6: Input data set with three multinormal clusters that differ in size and dispersion.
For this data set, the Ward and Average methods both found 5 clusters, while the Centroid method found 4 clusters.
Figure 7: Input data set grouped into 5 clusters using Ward`s method
Figure 8: Input data set grouped into 5 clusters using Centroid method
When we increase the “Preliminary Maximum” to 100, all methods find 5 as number of clusters.
We`ve seen how to use the Automatic selection method in Cluster Node. Each of the methods discussed here has its own advantages. Here are some rules of thumb that can help you choose the right method for your data:
When the clusters have elongated or irregular shapes, consider transforming the input variables before clustering. Some transformations on variables can generate more spherical clusters, which can be more easily detected by the Cluster Node.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.