turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- Clustering in SAS Enterprise Miner

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-28-2011 08:10 PM

Hi,

I have a couple of questions on clustering (using Cluster Node) using SAS Enterprise Miner and am hoping that someone can help.

I have a data set of almost 10,000 customers containing their age, tenure with the company, whether they are a High Net Worth customers (1 or 0), and a ranking of their product holdings (1 to 4, with 1 being the highest ranking).

Below are my questions:

1. For clustering on only continuous variables --> age and tenure, I am not sure which of the 3 Clustering Method options (Centroid, Average, and Ward) in the Cluster Node is best. I get different number of clusters for each of the 3 methods (Centroid, Average, and Ward). I get 5 optimal clusters using the Centroid method, 4 using the Average method, and 3 using the Ward method.

Values for each have been standardized and transformed (to eliminate right skew of data). Also, the option of CCC cutoff = 3 was used (by default).

2. For clustering on continuous and discrete variables --> age, tenure, high value status (0 or 1), and product ranking (1 to 4), customers with high value status are bucketed into their own cluster. Does it make sense to use binary/categorical variables in clustering?

In the Encoding of Class Variables option in the Cluster Node, I chose Ordinal Encoding = Rank and Nominal Encoding = GLM.

3. Not sure if there is a way to output the results of the clustering as a SAS data set other than copying and pasting the results (from Exported Data under the Train section in the Properties bar of the Cluster Node) into an Excel spreadsheet.

Thank you.

I have a couple of questions on clustering (using Cluster Node) using SAS Enterprise Miner and am hoping that someone can help.

I have a data set of almost 10,000 customers containing their age, tenure with the company, whether they are a High Net Worth customers (1 or 0), and a ranking of their product holdings (1 to 4, with 1 being the highest ranking).

Below are my questions:

1. For clustering on only continuous variables --> age and tenure, I am not sure which of the 3 Clustering Method options (Centroid, Average, and Ward) in the Cluster Node is best. I get different number of clusters for each of the 3 methods (Centroid, Average, and Ward). I get 5 optimal clusters using the Centroid method, 4 using the Average method, and 3 using the Ward method.

Values for each have been standardized and transformed (to eliminate right skew of data). Also, the option of CCC cutoff = 3 was used (by default).

2. For clustering on continuous and discrete variables --> age, tenure, high value status (0 or 1), and product ranking (1 to 4), customers with high value status are bucketed into their own cluster. Does it make sense to use binary/categorical variables in clustering?

In the Encoding of Class Variables option in the Cluster Node, I chose Ordinal Encoding = Rank and Nominal Encoding = GLM.

3. Not sure if there is a way to output the results of the clustering as a SAS data set other than copying and pasting the results (from Exported Data under the Train section in the Properties bar of the Cluster Node) into an Excel spreadsheet.

Thank you.

Accepted Solutions

Solution

07-07-2017
02:42 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Data_Guy

05-13-2011 09:41 PM

Hi, Data_guy,

1. Generally speaking it is hard to say which one of average, centroid or Ward is best, although I often lean towards ward. Average method often is more susceptible to some patterns of outliers. Both average method and centroid methods are often used to generate guide trees. You may want to look at the resulting clusters, profil them by some "KPI". In other words, often you need to study the 'configuration' of the clusters to decide which method is best for your application. Yielding different numbers of clusters is well expected off different methods.

2. Interval variables, that is, variables that not only rank but also measure "by how much", should be used for clustering. Ordinal or norminal variables should be avoided in clustering, since clustering essentially is to computer distance among observations with respect to the variables you specify.

You can use continuous variables to build some clusters. Then use the cluster variable as TARGET and build a DTree using the norminal/ rank variables you left out. Keep in mind that when clustering, you did not really conduct any variable selection, and DT at default setting is selecting variables, so you may want to 'relax' a little bit.

3. "Result of clustering" : could you explain what details you like exported and in what file structure (regular sas data set vs. special sas data set) ? what would you like to do with them in Excel? Pivotal report?

Hope this help.

Jason Xin

SAS Institute

Financial Services and Banking Unit

Boston

1. Generally speaking it is hard to say which one of average, centroid or Ward is best, although I often lean towards ward. Average method often is more susceptible to some patterns of outliers. Both average method and centroid methods are often used to generate guide trees. You may want to look at the resulting clusters, profil them by some "KPI". In other words, often you need to study the 'configuration' of the clusters to decide which method is best for your application. Yielding different numbers of clusters is well expected off different methods.

2. Interval variables, that is, variables that not only rank but also measure "by how much", should be used for clustering. Ordinal or norminal variables should be avoided in clustering, since clustering essentially is to computer distance among observations with respect to the variables you specify.

You can use continuous variables to build some clusters. Then use the cluster variable as TARGET and build a DTree using the norminal/ rank variables you left out. Keep in mind that when clustering, you did not really conduct any variable selection, and DT at default setting is selecting variables, so you may want to 'relax' a little bit.

3. "Result of clustering" : could you explain what details you like exported and in what file structure (regular sas data set vs. special sas data set) ? what would you like to do with them in Excel? Pivotal report?

Hope this help.

Jason Xin

SAS Institute

Financial Services and Banking Unit

Boston

All Replies

Solution

07-07-2017
02:42 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Data_Guy

05-13-2011 09:41 PM

1. Generally speaking it is hard to say which one of average, centroid or Ward is best, although I often lean towards ward. Average method often is more susceptible to some patterns of outliers. Both average method and centroid methods are often used to generate guide trees. You may want to look at the resulting clusters, profil them by some "KPI". In other words, often you need to study the 'configuration' of the clusters to decide which method is best for your application. Yielding different numbers of clusters is well expected off different methods.

2. Interval variables, that is, variables that not only rank but also measure "by how much", should be used for clustering. Ordinal or norminal variables should be avoided in clustering, since clustering essentially is to computer distance among observations with respect to the variables you specify.

You can use continuous variables to build some clusters. Then use the cluster variable as TARGET and build a DTree using the norminal/ rank variables you left out. Keep in mind that when clustering, you did not really conduct any variable selection, and DT at default setting is selecting variables, so you may want to 'relax' a little bit.

3. "Result of clustering" : could you explain what details you like exported and in what file structure (regular sas data set vs. special sas data set) ? what would you like to do with them in Excel? Pivotal report?

Hope this help.

Jason Xin

SAS Institute

Financial Services and Banking Unit

Boston

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to JasonXin

12-06-2012 03:56 PM

Jason,

Is there a SGF paper or any other document which outlines the apporach (with an example) that you have suggested, which involves building a DT on cluster variable and using the nominal/rank variables, after performing the clustering using the interval variables?

Thanks