BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Data_Guy
Calcite | Level 5
Hi,

I have a couple of questions on clustering (using Cluster Node) using SAS Enterprise Miner and am hoping that someone can help.

I have a data set of almost 10,000 customers containing their age, tenure with the company, whether they are a High Net Worth customers (1 or 0), and a ranking of their product holdings (1 to 4, with 1 being the highest ranking).

Below are my questions:

1. For clustering on only continuous variables --> age and tenure, I am not sure which of the 3 Clustering Method options (Centroid, Average, and Ward) in the Cluster Node is best. I get different number of clusters for each of the 3 methods (Centroid, Average, and Ward). I get 5 optimal clusters using the Centroid method, 4 using the Average method, and 3 using the Ward method.

Values for each have been standardized and transformed (to eliminate right skew of data). Also, the option of CCC cutoff = 3 was used (by default).

2. For clustering on continuous and discrete variables --> age, tenure, high value status (0 or 1), and product ranking (1 to 4), customers with high value status are bucketed into their own cluster. Does it make sense to use binary/categorical variables in clustering?

In the Encoding of Class Variables option in the Cluster Node, I chose Ordinal Encoding = Rank and Nominal Encoding = GLM.


3. Not sure if there is a way to output the results of the clustering as a SAS data set other than copying and pasting the results (from Exported Data under the Train section in the Properties bar of the Cluster Node) into an Excel spreadsheet.

Thank you.
1 ACCEPTED SOLUTION

Accepted Solutions
JasonXin
SAS Employee
Hi, Data_guy,

1. Generally speaking it is hard to say which one of average, centroid or Ward is best, although I often lean towards ward. Average method often is more susceptible to some patterns of outliers. Both average method and centroid methods are often used to generate guide trees. You may want to look at the resulting clusters, profil them by some "KPI". In other words, often you need to study the 'configuration' of the clusters to decide which method is best for your application. Yielding different numbers of clusters is well expected off different methods.

2. Interval variables, that is, variables that not only rank but also measure "by how much", should be used for clustering. Ordinal or norminal variables should be avoided in clustering, since clustering essentially is to computer distance among observations with respect to the variables you specify.

You can use continuous variables to build some clusters. Then use the cluster variable as TARGET and build a DTree using the norminal/ rank variables you left out. Keep in mind that when clustering, you did not really conduct any variable selection, and DT at default setting is selecting variables, so you may want to 'relax' a little bit.

3. "Result of clustering" : could you explain what details you like exported and in what file structure (regular sas data set vs. special sas data set) ? what would you like to do with them in Excel? Pivotal report?

Hope this help.

Jason Xin
SAS Institute
Financial Services and Banking Unit
Boston

View solution in original post

2 REPLIES 2
JasonXin
SAS Employee
Hi, Data_guy,

1. Generally speaking it is hard to say which one of average, centroid or Ward is best, although I often lean towards ward. Average method often is more susceptible to some patterns of outliers. Both average method and centroid methods are often used to generate guide trees. You may want to look at the resulting clusters, profil them by some "KPI". In other words, often you need to study the 'configuration' of the clusters to decide which method is best for your application. Yielding different numbers of clusters is well expected off different methods.

2. Interval variables, that is, variables that not only rank but also measure "by how much", should be used for clustering. Ordinal or norminal variables should be avoided in clustering, since clustering essentially is to computer distance among observations with respect to the variables you specify.

You can use continuous variables to build some clusters. Then use the cluster variable as TARGET and build a DTree using the norminal/ rank variables you left out. Keep in mind that when clustering, you did not really conduct any variable selection, and DT at default setting is selecting variables, so you may want to 'relax' a little bit.

3. "Result of clustering" : could you explain what details you like exported and in what file structure (regular sas data set vs. special sas data set) ? what would you like to do with them in Excel? Pivotal report?

Hope this help.

Jason Xin
SAS Institute
Financial Services and Banking Unit
Boston
dk
Calcite | Level 5 dk
Calcite | Level 5

Jason,

Is there a SGF paper or any other document which outlines the apporach (with an example) that you have suggested, which involves building a DT on cluster variable and using the nominal/rank variables, after performing the clustering using the interval variables?

Thanks

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 6025 views
  • 0 likes
  • 3 in conversation