topic Re: SAS EM: is cluster node using K means? in SAS Data Science

SAS EM: is cluster node using K means?

ycenycute — Wed, 26 Oct 2022 11:26:45 GMT

Wondering is cluster node in EM using K means algorithm? I know K means leverage the distance between centroids from two clusters. So shall I change clustering method to centroid to enable K means algorithm?

Re: SAS EM: is cluster node using K means?

sbxkoenk — Thu, 27 Oct 2022 20:11:48 GMT

Hello,

If I remember well, this is how the cluster node in Enterprise Miner works :

k-means clustering is done to get k clusters (by default k=50, I believe). [It can also be another BIG number like 40 or 60.]
Then with the k multivariate means / mean vectors of the clusters, an agglomerative hierarchical clustering is done (from k clusters to one cluster).
By means of the CCC (Cubic Clustering Criterion) the "best" number of clusters is then selected. Let's say it is m (< k).
Then m-means on all original observations is done to get the final clusters.

Procedures used are PROC FASTCLUS and PROC CLUSTER.

SAS® Enterprise Miner™ 15.1: Reference Help
Cluster Node

https://go.documentation.sas.com/doc/en/emref/15.1/p042mbykzcvpoln1m14cycem6m4a.htm

I think there are also High-Performance nodes in Enterprise Miner 15.1 and 15.2.
The High-Performance nodes also have clustering.
Using the High-Performance clustering node, PROC HPCLUS is used.
That is k-means clustering only.
To estimate the number of clusters (NOC), NOC=ABC is specified in the PROC HPCLUS statement.
This option uses the aligned box criterion (ABC) method to find the "best" n° of clusters.

BR,

Koen

Re: SAS EM: is cluster node using K means?

ycenycute — Fri, 28 Oct 2022 04:14:05 GMT

In Enterprise Miner, there is selection criteria, what is the differences between Ward and Centroid? Are they both using K-means algorithm? Centroid seems like K-means because K-means is based on calculating distance between centroid and other data points.

Re: SAS EM: is cluster node using K means?

sbxkoenk — Fri, 28 Oct 2022 06:23:56 GMT

Hello,

That property is for the PROC CLUSTER (agglomerative hierarchical clustering) part of the algorithm!

See here :

SAS/STAT® 15.2 User's Guide
The CLUSTER Procedure
Clustering Methods

https://documentation.sas.com/doc/en/statug/15.2/statug_cluster_details01.htm

For k-means you do not have that choice (distances in k-means are always distances to the centroid).

But k-means starts with k-clusters and ends with k clusters (the way of constituting the clusters is completely different than it is for hierarchical clustering).

Koen

Re: SAS EM: is cluster node using K means?

ycenycute — Fri, 28 Oct 2022 06:24:20 GMT

Okay. Are you suggesting that if I drag a cluster node into the diagram, it does not matter if I choose Ward or Centroid in the property panel on the left? Because I am able to choose Ward or Centroid if I select the cluster node (I don't think it is hierarchical clustering node). Are you suggesting these two methods will give the same results?

Re: SAS EM: is cluster node using K means?

sbxkoenk — Fri, 28 Oct 2022 08:43:12 GMT

Hello,

Ward and centroid method will probably not give the same end-result. Unless the derived number of clusters is the same when using both methods.

remember my first reply :

First a PROC FASTCLUS is done with k=50 (or another BIG number).
Then the 50 mean vectors are hierarchically clustered (PROC CLUSTER) using WARD or centroid method to guess the best n° of clusters. Let's say that is m (m <= k ).
Then k-means on the original data is done again with PROC FASTCLUS and k = m.

Good luck with your analyses !

Koen

Re: SAS EM: is cluster node using K means?

ycenycute — Fri, 28 Oct 2022 08:50:48 GMT

Hi, I am not familiar with the SAS code. Thus, I don't know what the difference between PROC FASTCLUS and PROC CLUSTER is.

I use SAS EM. And I drag a Cluster node under the Explore tab to the diagram and connect the Cluster node to my data node. Then if I select the cluster node, in the property panel on the left, there is Ward, Centroid and other options under selecting criteria. My question is if I would like use K means, shall I pick Centroid as the selecting criteria? Because I don't think Ward is related to K means algorithm. Or do they both apply to K means algorithm?

Sorry, my question was moved from new users forum to here. I am not sure if I can get help here.

Re: SAS EM: is cluster node using K means?

sbxkoenk — Fri, 28 Oct 2022 10:11:37 GMT

Hello,

When doing k-means clustering, one of the difficult questions is :
----> What to choose as the value of k (number of clusters)?

That question is answered in the Enterprise Miner clustering node by using an intermediate hierarchical clustering step.

For that (intermediate) hierarchical clustering step, the methods WARD and CENTROID are relevant.

[ ... Ward´s linkage is thus a method for hierarchical cluster analysis (nothing to do with k-means!!).
The idea has much in common with analysis of variance (ANOVA). The WARD linkage function specifying the distance between two clusters is computed as the increase in the "error sum of squares" (ESS) after fusing two clusters into a single cluster. ]

Once the number of clusters is determined, WARD and CENTROID are no longer relevant.

Because once k is set equal to 11 for example, an 11-means clustering is done using the k-means algorithm (with k=11).

WARD is generally believed to be better than CENTROID, so go for WARD !

Again, the cluster node works with k-means for the (preliminary and) final clustering and the WARD and CENTROID can only determine the number of clusters.

Koen

Re: SAS EM: is cluster node using K means?

ycenycute — Fri, 28 Oct 2022 13:19:45 GMT

I see. This is super helpful. One more question, if I select centroid, how is optimal K selected?

Re: SAS EM: is cluster node using K means?

sbxkoenk — Sat, 29 Oct 2022 14:27:02 GMT

Hello @ycenycute ,

How is "optimal" k selected?

Suppose you have 100 000 observations in a 20-dimensional input space.

First there's a k-means to cluster the 100 000 observations into 50 disjoint clusters.

The 50 mean vectors (multivariate means) of these 50 disjoint clusters are then hierarchically clustered. From 50 to 1.

The distance between clusters is calculated using the centroid method and the two clusters that are closest together (using centroid linking) are merged in such an agglomerative hierarchical clustering step. You start with 50 single-element clusters and you end up with 1.
Then using the CCC (Cubic Clustering Criterion) the "best" k is selected, because with k clusters it is believed the solution is "optimal" (i.e. the most heterogeneity among the clusters and the most homogeneity within the clusters).

Suppose k is selected to be 8.

Then a new k-means clustering on the full 100 000 observations is done with k = 8 (to make 8 disjoint clusters).

In data mining the data sets are mostly too big to do only hierarchical clustering.
Doing hierarchical clustering on 100 000 observations may take a full day and lots of resources.

That is because you start with 100 000 single element clusters and in each step you merge two clusters (to eventually reach one cluster containing all observations).

Koen