11-20-2015 05:05 AM
I have 230 variables and 15.000 observations in my dataset. 30 of the variables are categorical. My goal is to find meaningful clusters out of this population by using SAS EM Clustering Node.
These are the steps that I apply before clustering.
- Outlier elimination
- Missing value imputation
- Encoding categorical variables ( by creating dummy binary variables )
I have 4 questions:
1. Do you recommend any other analyses in order to obtain better results ?
2. "Incorporating the categorical variables in clustering by binarizing them" is the best way to use them?
3. As far as I researched, the number of my variables is too many for clustering. So as a next step, I need to reduce the number of input variables.
I tried applying 'Principal Components' and 'Variable Clustering' before the 'Clustering'. I ended up with 2 different clusters but I'm having troubles to interpret these clusters.
When I check the output of 'Segment Profile' node, I see the distributions of either variable clusters or principal components as. How can I know which components are related to which variables?
4. How do I asses the results of clustering ?
Thanks in advance
a week ago
You're already doing some useful preprocessing, handling missing values and taking care of collinearity. Clustering is as much art as science, so there are many different pre- and post-processing tools that can be useful.
First, I'd check that you've got variables that provide useful information about the segments you are interested in. Since you don't have a target variable in clustering, relevance of the inputs is determined based on your domain knowledge. Eliminate any that don't clearly have anything to do with your desired segments.
It sounds like you will want to interpret the clusters, which sends you down tone of two different paths to handle collinearity.
One path is the way you went, by performing PCA, and using the PCs as input to the cluster analysis. Then, when you use the segment profile node, set the PC variables to not be used, but set the original input variables to be used instead. This will enable you to make sense of the clusters in terms of the original variables, even though the PCs were used for deriving clusters.
The other path you can take is to select exemplar variables from the variable clustering, instead of using variable cluster scores. When you do this, the cluster analysis is based on a reduced number of input variables, which are still somewhat correlated.
I hope this helps!