We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Tip: Guidelines for Choosing a Clustering Method in the Cluster Node

by SAS Employee ilknurkabul on ‎01-06-2015 02:43 PM - edited on ‎10-06-2015 01:25 PM by Community Manager (4,461 Views)

The Cluster Node in SAS® Enterprise Miner™ gives you two options for specifying the number of clusters. You can specify a fixed number of clusters or let Enterprise Miner find the number of clusters for you by selecting the “Automatic” option.

 

If you use the "Automatic" option, the options under “Selection Criterion” allow you to set different parameters and methods used in the automatic selection process.

clusterNodeOptions.PNG

     Figure 1: Screenshot of the options

 

In this tip I will explain the available clustering methods used in the automatic selection process. But before we begin, let me summarize how the automatic selection process works in the Cluster Node:

 

  1. A large number of preliminary cluster seeds are selected and observations in the training data are assigned to the closest cluster seed. Means of the input variables in each of these preliminary clusters are computed. Note that you can specify the initial number of seeds using the “Preliminary Maximum” option.
  2. An agglomerative hierarchical algorithm is used to consolidate the preliminary clusters. The Cubic Clustering Criterion (CCC) is calculated at each step of the consolidation and the number of clusters is estimated using these CCC values.
  3. The smallest number of clusters is chosen that meets all four of the following criteria:
    1. The number of clusters is greater than or equal to the Minimum specified in Selection Criterion properties.
    2. The number of clusters has CCC values that are greater than the CCC Cutoff specified in the Selection Criterion properties.
    3. The number of clusters is less than or equal to the Final Maximum value.
    4. A peak in the number of clusters exists.

 

If none of these criteria are met, then the number of clusters is set to the first local peak.

 

Agglomerative hierarchical clustering starts with each observation in preliminary clusters and combines the clusters that are most similar in terms of distance between the clusters at each stage. The inter-cluster distance measure can be calculated in many ways.  In the Cluster Node, three methods are available:

 

  1. Average linkage method: The distance between two clusters is the average pairwise distance between each cluster.  This method:
    1. Tends to join clusters with small variances.
    2. Is slightly biased to finding clusters with the equal variance.
    3. Avoids the extremes of either large clusters or tight compact clusters.
  2. Centroid method: The distance between two clusters is the squared Euclidean distance between their means. This method is more robust to outliers than most of the other hierarchical methods, but does not generally perform as well as Ward`s method or the Average method.
  3. Ward`s method. This method does not use cluster distances to combine clusters. Instead, it joins the clusters such that the variation inside each cluster will not increase drastically. This method:
    1. Tends to join clusters with few observations
    2. Minimizes the variance within each cluster. Therefore, it tends to produce homogeneous clusters and a symmetric hierarchy.
    3. Is biased toward finding clusters of equal size (similar to k-means) and approximately spherical shape.  It can be considered as the hierarchical analogue of k-means.
    4. Is poor at recovering elongated clusters.

 

The graphs below show how these methods perform for some simulated data sets having different characteristics. Each dataset has 100 points and I used 50 (the default) as the initial number of cluster seeds in the first step.  The other half of the points were used for estimating the number of clusters.

 

  1. Well separated data: In this example, there are three well separated and compact clusters. For such data sets, any of the methods will perform very well. All three methods selected 3 as the number of clusters.

     InputData.png

                      Figure 2: Input data set with 3 clusters

 

  1. Poorly separated data:  In this example, there are three poorly separated and compact clusters.

    InputData.png

            Figure 2: Input data set with 3 poorly separated clusters

 

          For data sets containing clusters with these characteristics, different methods with different parameters need to be investigated. In this case, Ward’s method found 7 clusters:

     AutoWard_50_2_20_3-----.png

     Figure 3: Input data set grouped into 7 clusters using Ward`s method

 

     By comparison, the Average method found 6 clusters, while the Centroid method found 9 clusters.

 

     When we increase the “Preliminary Maximum” to 100 (basically using all points to estimate the number of clusters), the Ward and Average methods both found 3 as the number of clusters, while the Centroid method found 5.

     AutoWard_100_2_20_3-----.png

     Figure 4: Input data set grouped into 3 clusters using Ward`s method

     AutoCentroid_100_2_20_3.png

     Figure 5: Input data set grouped into 5 clusters using Centroid method

 

  1. Multinormal clusters of unequal size and dispersion:

     InputData.png

     Figure 6: Input data set with three multinormal clusters that differ in size and dispersion.

 

     For this data set, the Ward and Average methods both found 5 clusters, while the Centroid method found 4 clusters.

 

     AutoWard_50_2_20_3.png

     Figure 7: Input data set grouped into 5 clusters using Ward`s method

 

     AutoCentroid_50_2_20_3.png

     Figure 8: Input data set grouped into 5 clusters using Centroid method

 

When we increase the “Preliminary Maximum” to 100, all methods find 5 as number of clusters.

 

Conclusions

 

We`ve seen how to use the Automatic selection method in Cluster Node. Each of the methods discussed here has its own advantages. Here are some rules of thumb that can help you choose the right method for your data:

 

  1. The Cluster Node uses the Ward, Average and Centroid methods for finding the number of clusters. After the number of clusters is determined, the clusters are obtained using a k-means algorithm.
  2. If the natural clusters are well separated from each other, any of the above algorithms will perform very well.
  3. If the clusters overlap, the methods will find different numbers of clusters. To choose the best number of clusters, you can view the CCC plot from the results section and check how the values change depending on the number of clusters. (For details on the CCC method, see the SAS Technical Report.) You may also want to check the Cluster node log to see whether there are any warnings regarding the number of clusters.

 

When the clusters have elongated or irregular shapes, consider transforming the input variables before clustering.  Some transformations on variables can generate more spherical clusters, which can be more easily detected by the Cluster Node.

Comments
by Contributor husseinmazaar
on ‎01-08-2015 12:36 PM

IT is a great guide, thanks

by SAS Employee ilknurkabul
on ‎01-08-2015 01:22 PM

Glad you found this useful. Finding the number of clusters in a data set is a very challenging problem. I think understanding these options is really important to figure out how you should set them. I highly encourage you to check the CCC plot from the results section after you run the node. As a side information, you can also look at the tip: This tip explains another way (briefly) to find the number of clusters using the NOC=ABC option in proc HPCLUS.

by SAS Employee ksouthall
on ‎03-31-2015 01:06 PM

Question: how did you create the scatter plots (variables used)?

by SAS Employee ilknurkabul
on ‎04-02-2015 02:39 PM

I used %em_report to create the scattered plots. I used _SEGMENT_ for the GROUP option. Below is an example. Please let me know if you have any questions.

%em_register(type=Data,key=MyKey2);

data &em_user_MyKey2;

set &em_import_Data;

run;

%em_report(KEY=MyKey2,VIEWTYPE=scatter,X=x,Y=y,GROUP=_SEGMENT_, DESCRIPTION=AutoAverage , AUTODISPLAY=y);

by Occasional Contributor Noelblanc
on ‎10-22-2015 06:05 PM

hello,

You said: "After the number of clusters is determined, the clusters are obtained using a k-means algorithm."

 

i want to know :  which clustering Method is really performs in Cluster Node ? hierarchical cluster algorithm or non hierarchical ?

because your colleague tell me that cluster node performs a hierarchical cluster algorithm

by Occasional Contributor Noelblanc
on ‎10-22-2015 06:22 PM

how to create a scatter plot for data with five variable in Sas EMINER?

we don 't have a canonical node

Contributors
Your turn
Sign In!

Want to write an article? Sign in with your profile.