Solved: SAS EM: is cluster node using K means?

ycenycute · Posted 10-26-2022 07:26 AM

Wondering is cluster node in EM using K means algorithm? I know K means leverage the distance between centroids from two clusters. So shall I change clustering method to centroid to enable K means algorithm?

sbxkoenk · Posted 10-28-2022 06:10 AM

Hello,

When doing k-means clustering, one of the difficult questions is :
----> What to choose as the value of k (number of clusters)?

That question is answered in the Enterprise Miner clustering node by using an intermediate hierarchical clustering step.

For that (intermediate) hierarchical clustering step, the methods WARD and CENTROID are relevant.

[ ... Ward´s linkage is thus a method for hierarchical cluster analysis (nothing to do with k-means!!).
The idea has much in common with analysis of variance (ANOVA). The WARD linkage function specifying the distance between two clusters is computed as the increase in the "error sum of squares" (ESS) after fusing two clusters into a single cluster. ]

Once the number of clusters is determined, WARD and CENTROID are no longer relevant.

Because once k is set equal to 11 for example, an 11-means clustering is done using the k-means algorithm (with k=11).

WARD is generally believed to be better than CENTROID, so go for WARD !

Again, the cluster node works with k-means for the (preliminary and) final clustering and the WARD and CENTROID can only determine the number of clusters.

Koen

View solution in original post

sbxkoenk · Posted 10-27-2022 04:11 PM

Hello,

If I remember well, this is how the cluster node in Enterprise Miner works :

k-means clustering is done to get k clusters (by default k=50, I believe). [It can also be another BIG number like 40 or 60.]
Then with the k multivariate means / mean vectors of the clusters, an agglomerative hierarchical clustering is done (from k clusters to one cluster).
By means of the CCC (Cubic Clustering Criterion) the "best" number of clusters is then selected. Let's say it is m (< k).
Then m-means on all original observations is done to get the final clusters.

Procedures used are PROC FASTCLUS and PROC CLUSTER.

SAS® Enterprise Miner™ 15.1: Reference Help
Cluster Node

https://go.documentation.sas.com/doc/en/emref/15.1/p042mbykzcvpoln1m14cycem6m4a.htm

I think there are also High-Performance nodes in Enterprise Miner 15.1 and 15.2.
The High-Performance nodes also have clustering.
Using the High-Performance clustering node, PROC HPCLUS is used.
That is k-means clustering only.
To estimate the number of clusters (NOC), NOC=ABC is specified in the PROC HPCLUS statement.
This option uses the aligned box criterion (ABC) method to find the "best" n° of clusters.

BR,

Koen

ycenycute · Posted 10-28-2022 12:14 AM

In Enterprise Miner, there is selection criteria, what is the differences between Ward and Centroid? Are they both using K-means algorithm? Centroid seems like K-means because K-means is based on calculating distance between centroid and other data points.

sbxkoenk · Posted 10-28-2022 02:19 AM

Hello,

That property is for the PROC CLUSTER (agglomerative hierarchical clustering) part of the algorithm!

See here :

SAS/STAT® 15.2 User's Guide
The CLUSTER Procedure
Clustering Methods

https://documentation.sas.com/doc/en/statug/15.2/statug_cluster_details01.htm

For k-means you do not have that choice (distances in k-means are always distances to the centroid).

But k-means starts with k-clusters and ends with k clusters (the way of constituting the clusters is completely different than it is for hierarchical clustering).

Koen

ycenycute · Posted 10-28-2022 02:24 AM

Okay. Are you suggesting that if I drag a cluster node into the diagram, it does not matter if I choose Ward or Centroid in the property panel on the left? Because I am able to choose Ward or Centroid if I select the cluster node (I don't think it is hierarchical clustering node). Are you suggesting these two methods will give the same results?

sbxkoenk · Posted 10-28-2022 04:43 AM

Hello,

Ward and centroid method will probably not give the same end-result. Unless the derived number of clusters is the same when using both methods.

remember my first reply :

First a PROC FASTCLUS is done with k=50 (or another BIG number).
Then the 50 mean vectors are hierarchically clustered (PROC CLUSTER) using WARD or centroid method to guess the best n° of clusters. Let's say that is m (m <= k ).
Then k-means on the original data is done again with PROC FASTCLUS and k = m.

Good luck with your analyses !

Koen

ycenycute · Posted 10-28-2022 04:50 AM

Hi, I am not familiar with the SAS code. Thus, I don't know what the difference between PROC FASTCLUS and PROC CLUSTER is.

I use SAS EM. And I drag a Cluster node under the Explore tab to the diagram and connect the Cluster node to my data node. Then if I select the cluster node, in the property panel on the left, there is Ward, Centroid and other options under selecting criteria. My question is if I would like use K means, shall I pick Centroid as the selecting criteria? Because I don't think Ward is related to K means algorithm. Or do they both apply to K means algorithm?

Sorry, my question was moved from new users forum to here. I am not sure if I can get help here.

sbxkoenk · Posted 10-28-2022 06:10 AM