turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- Clustering with Too Many Variables

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-20-2015 05:05 AM

Hi,

I have 230 variables and 15.000 observations in my dataset. 30 of the variables are categorical. My goal is to find meaningful clusters out of this population by using SAS EM Clustering Node.

These are the steps that I apply before clustering.

- Outlier elimination

- Missing value imputation

- Encoding categorical variables ( by creating dummy binary variables )

I have 4 questions:

1. Do you recommend any other analyses in order to obtain better results ?

2. "Incorporating the categorical variables in clustering by binarizing them" is the best way to use them?

3. As far as I researched, the number of my variables is too many for clustering. So as a next step, I need to reduce the number of input variables.

I tried applying 'Principal Components' and 'Variable Clustering' before the 'Clustering'. I ended up with 2 different clusters but I'm having troubles to interpret these clusters.

When I check the output of 'Segment Profile' node, I see the distributions of either variable clusters or principal components as. How can I know which components are related to which variables?

4. How do I asses the results of clustering ?

Thanks in advance

Regards,

Gorkem

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

a week ago

You're already doing some useful preprocessing, handling missing values and taking care of collinearity. Clustering is as much art as science, so there are many different pre- and post-processing tools that can be useful.

First, I'd check that you've got variables that provide useful information about the segments you are interested in. Since you don't have a target variable in clustering, relevance of the inputs is determined based on your domain knowledge. Eliminate any that don't clearly have anything to do with your desired segments.

It sounds like you will want to interpret the clusters, which sends you down tone of two different paths to handle collinearity.

One path is the way you went, by performing PCA, and using the PCs as input to the cluster analysis. Then, when you use the segment profile node, set the PC variables to not be used, but set the original input variables to be used instead. This will enable you to make sense of the clusters in terms of the original variables, even though the PCs were used for deriving clusters.

The other path you can take is to select exemplar variables from the variable clustering, instead of using variable cluster scores. When you do this, the cluster analysis is based on a reduced number of input variables, which are still somewhat correlated.

I hope this helps!

Cat