BookmarkSubscribeRSS Feed
KJazem
Obsidian | Level 7

I want to implement a segmentation methodology for a bank for their business banking clients - SMEs, MSBs, etc. The type of data they have includes: client level data (client industry, current status (active/inactive), what branch they opened their accounts, etc.), product holding information (what products they hold, product activation date/tenure, interest and fee income in the last 2 years, etc.), bank-to-bank transactions, POS billing, and more. The types of products include: POS, payment gateways, credit, debit and prepaid cards, fixed deposit accounts, interest bearing accounts, insurance account, trade finance (letter of credit and letter of guarantee), etc.

 

The client has both SAS EG and SAS EM. I wanted to know, from anyone's experience here, what the best clustering technique for this use-base would be. I have very little experience with SAS EM, but am I correct in assuming it supports the most common clustering algorithms - k-means, SOMs, hierarchical, etc.?  Note that retail customers are completely excluded in this use-case. 

 

Any guidance would be appreciated. Please move this accordingly if it doesn't fit here.

6 REPLIES 6
GuyTreepwood
Obsidian | Level 7
Hello,

For SAS EM, the Cluster node should do k-means and hierarchical clustering, using the Centroid and Ward options, respectively, under the Clustering Method menu. For SOM, there is the SOM/Kohenen node.

You can find the Cluster node documentation here: https://documentation.sas.com/doc/en/emref/14.3/n1vjatb74dundbn12d2ecb09juak.htm

and the SOM/Kohonen here: https://documentation.sas.com/doc/en/emref/14.3/n0978xngiafo2ln1mpj80trq36qk.htm

You can perform hierarchical and k-means clustering in as well EG using PROC CLUSTER, setting the method= option to either Centroid to Ward.

Hope this helps.
sbxkoenk
SAS Super FREQ

Hello @KJazem ,

 

I must unfortunately contradict @GuyTreepwood .
The CLUSTER node in SAS Enterprise Miner does NOT do full-fledged hierarchical clustering on all observations (for big data, that would be an extremely challenging task). Hierarchical clustering in EM CLUSTER node is only an intermediate step to estimate the "best" number of clusters.

 

The Cluster node in Enterprise Miner (latest version is 15.2) is doing K-MEANS clustering!!

Hierarchical clustering is just an intermediate step to determine the best number of clusters.

 

This is how the CLUSTER node (in the Explore Group) works ... when you do not change the defaults :

  1. k-means is done with k=50 (preliminary maximum)
  2. Then the 50 multivariate mean vectors are clustered with WARD (agglomerative) hierarchical clustering method
  3. Then the best number of clusters is determined (minimum=2 , final maximum=20). Let's say best = 8 !
  4. Then a k-means is done again on the full dataset with k=8.

You can also use the "HP Cluster" node in the HPDM group of nodes (HPDM = High-Performance Data Mining).

The "HP Cluster" node is running PROC HPCLUS in the background. The HPCLUS procedure is a high-performance procedure that performs k-means clustering.
And that "HP Cluster" node (PROC HPCLUS) is finding the number of clusters (the k) using the aligned box criterion (ABC) method (and NOT via that foray into hierarchical clustering).

In VIYA PROC HPCLUS evolved into PROC KCLUS.

 

Via the "Open Source Integration Node" in SAS EM, you can also apply "Spectral Clustering" to your data!

Via the "SAS Code Node" in SAS EM, you can also apply PROC MODECLUS to your data!

 

MODECLUS

finds disjoint clusters of observations with coordinate or distance data by using nonparametric density estimation. It can also perform approximate nonparametric significance tests for the number of clusters.

Good luck,

Koen

sbxkoenk
SAS Super FREQ

On top of previous reply, I add this note :

 

The best way, in my opinion, to assess the quality of your clustering solution is the Silhouette Coefficient.

(you do ultimately want heterogeneity between clusters and homogeneity within clusters)

 

Here are 3 useful articles / blogs :

If you do not have SAS/IML (PROC IML) in your license, then you should calculate Silhouette coefficient with a macro that uses PROC DISTANCE and PROC MEANS and some data steps.

 

Good luck,

Koen

KJazem
Obsidian | Level 7
These are very helpful, thank you for the references. A couple of follow-up questions: 1) Would you say K-means clustering works best with customer segmentation? We have many features so just want to see which works best - K-means, SOM, etc. and 2) Is the Silhouette coefficient the best metric to evaluate any clustering algorithm or specifically K-means?

Thanks for the help!
sbxkoenk
SAS Super FREQ

Hello,


@KJazem wrote:
1) Would you say K-means clustering works best with customer segmentation? We have many features so just want to see which works best - K-means, SOM, etc. and
2) Is the Silhouette coefficient the best metric to evaluate any clustering algorithm or specifically K-means?

1) Hierarchical clustering (like done with PROC CLUSTER) is superior to k-means disjoint clustering in general, but with tens of thousands of customers and many features, it can take many hours for calculations to finish.
Also, you might need to transform the data before clustering (same for k-means by the way).

For example, you can use the ACECLUS procedure to obtain approximate estimates of the pooled within-cluster covariance matrix and to compute canonical variables for subsequent analysis. You use PROC ACECLUS to preprocess data before you cluster it by using the CLUSTER procedure.
PROC CLUSTER has many Clustering Methods (ultrametric and others) you can try out.

 

2) Silhouette coefficient is the best metric to evaluate any clustering solution no matter which algorithm was used to establish the clustering solution.

 

Koen

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 3998 views
  • 3 likes
  • 3 in conversation