I want to implement a segmentation methodology for a bank for their business banking clients - SMEs, MSBs, etc. The type of data they have includes: client level data (client industry, current status (active/inactive), what branch they opened their accounts, etc.), product holding information (what products they hold, product activation date/tenure, interest and fee income in the last 2 years, etc.), bank-to-bank transactions, POS billing, and more. The types of products include: POS, payment gateways, credit, debit and prepaid cards, fixed deposit accounts, interest bearing accounts, insurance account, trade finance (letter of credit and letter of guarantee), etc.
The client has both SAS EG and SAS EM. I wanted to know, from anyone's experience here, what the best clustering technique for this use-base would be. I have very little experience with SAS EM, but am I correct in assuming it supports the most common clustering algorithms - k-means, SOMs, hierarchical, etc.? Note that retail customers are completely excluded in this use-case.
Any guidance would be appreciated. Please move this accordingly if it doesn't fit here.
Hello @KJazem ,
I must unfortunately contradict @GuyTreepwood .
The CLUSTER node in SAS Enterprise Miner does NOT do full-fledged hierarchical clustering on all observations (for big data, that would be an extremely challenging task). Hierarchical clustering in EM CLUSTER node is only an intermediate step to estimate the "best" number of clusters.
The Cluster node in Enterprise Miner (latest version is 15.2) is doing K-MEANS clustering!!
Hierarchical clustering is just an intermediate step to determine the best number of clusters.
This is how the CLUSTER node (in the Explore Group) works ... when you do not change the defaults :
You can also use the "HP Cluster" node in the HPDM group of nodes (HPDM = High-Performance Data Mining).
The "HP Cluster" node is running PROC HPCLUS in the background. The HPCLUS procedure is a high-performance procedure that performs k-means clustering.
And that "HP Cluster" node (PROC HPCLUS) is finding the number of clusters (the k) using the aligned box criterion (ABC) method (and NOT via that foray into hierarchical clustering).
In VIYA PROC HPCLUS evolved into PROC KCLUS.
Via the "Open Source Integration Node" in SAS EM, you can also apply "Spectral Clustering" to your data!
Via the "SAS Code Node" in SAS EM, you can also apply PROC MODECLUS to your data!
finds disjoint clusters of observations with coordinate or distance data by using nonparametric density estimation. It can also perform approximate nonparametric significance tests for the number of clusters.
Good luck,
Koen
On top of previous reply, I add this note :
The best way, in my opinion, to assess the quality of your clustering solution is the Silhouette Coefficient.
(you do ultimately want heterogeneity between clusters and homogeneity within clusters)
Here are 3 useful articles / blogs :
If you do not have SAS/IML (PROC IML) in your license, then you should calculate Silhouette coefficient with a macro that uses PROC DISTANCE and PROC MEANS and some data steps.
Good luck,
Koen
Hello,
@KJazem wrote:
1) Would you say K-means clustering works best with customer segmentation? We have many features so just want to see which works best - K-means, SOM, etc. and
2) Is the Silhouette coefficient the best metric to evaluate any clustering algorithm or specifically K-means?
1) Hierarchical clustering (like done with PROC CLUSTER) is superior to k-means disjoint clustering in general, but with tens of thousands of customers and many features, it can take many hours for calculations to finish.
Also, you might need to transform the data before clustering (same for k-means by the way).
For example, you can use the ACECLUS procedure to obtain approximate estimates of the pooled within-cluster covariance matrix and to compute canonical variables for subsequent analysis. You use PROC ACECLUS to preprocess data before you cluster it by using the CLUSTER procedure.
PROC CLUSTER has many Clustering Methods (ultrametric and others) you can try out.
2) Silhouette coefficient is the best metric to evaluate any clustering solution no matter which algorithm was used to establish the clustering solution.
Koen
Hello @KJazem ,
For inspiration, you can also look here :
https://www.lexjansen.com/search/searchresults.php?q=%22customer%20segmentation%22
[[
SAS Tip: Learn lexjansen.com
https://communities.sas.com/t5/SAS-Tips-from-the-Community/SAS-Tip-Learn-lexjansen-com/td-p/436336
]]
Koen
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.
