BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
gorkemkilic
Calcite | Level 5

Hi,

I have 230 variables and 15.000 observations in my dataset. 30 of the variables are categorical. My goal is to find meaningful clusters out of this population by using SAS EM Clustering Node.

These are the steps that I apply before clustering.

- Outlier elimination
- Missing value imputation
- Encoding categorical variables ( by creating dummy binary variables )

I have 4 questions:

1. Do you recommend any other analyses in order to obtain better results ?  
2. "Incorporating the categorical variables in clustering  by binarizing them" is the best way to use them?
3. As far as I researched, the number of my variables is too many for clustering. So as a next step,  I need to reduce the number of input variables.
I tried applying 'Principal Components' and 'Variable Clustering' before the 'Clustering'.  I ended up with 2 different clusters but I'm having troubles to interpret these clusters.
When I check the output of 'Segment Profile' node, I see the distributions of either variable clusters or principal components as. How can I know which components are related to which variables?

4. How do I asses the results of clustering ?
Thanks in advance

Regards,
Gorkem

 

1 ACCEPTED SOLUTION

Accepted Solutions
CatTruxillo
SAS Employee

You're already doing some useful preprocessing, handling missing values and taking care of collinearity. Clustering is as much art as science, so there are many different pre- and post-processing tools that can be useful.

First, I'd check that you've got variables that provide useful information about the segments you are interested in. Since you don't have a target variable in clustering, relevance of the inputs is determined based on your domain knowledge. Eliminate any that don't clearly have anything to do with your desired segments.

It sounds like you will want to interpret the clusters, which sends you down tone of two different paths to handle collinearity.

One path is the way you went, by performing PCA, and using the PCs as input to the cluster analysis. Then, when you use the segment profile node, set the PC variables to not be used, but set the original input variables to be used instead. This will enable you to make sense of the clusters in terms of the original variables, even though the PCs were used for deriving clusters.

The other path you can take is to select exemplar variables from the variable clustering, instead of using variable cluster scores. When you do this, the cluster analysis is based on a reduced number of input variables, which are still somewhat correlated.

I hope this helps!

Cat

View solution in original post

1 REPLY 1
CatTruxillo
SAS Employee

You're already doing some useful preprocessing, handling missing values and taking care of collinearity. Clustering is as much art as science, so there are many different pre- and post-processing tools that can be useful.

First, I'd check that you've got variables that provide useful information about the segments you are interested in. Since you don't have a target variable in clustering, relevance of the inputs is determined based on your domain knowledge. Eliminate any that don't clearly have anything to do with your desired segments.

It sounds like you will want to interpret the clusters, which sends you down tone of two different paths to handle collinearity.

One path is the way you went, by performing PCA, and using the PCs as input to the cluster analysis. Then, when you use the segment profile node, set the PC variables to not be used, but set the original input variables to be used instead. This will enable you to make sense of the clusters in terms of the original variables, even though the PCs were used for deriving clusters.

The other path you can take is to select exemplar variables from the variable clustering, instead of using variable cluster scores. When you do this, the cluster analysis is based on a reduced number of input variables, which are still somewhat correlated.

I hope this helps!

Cat

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 4793 views
  • 0 likes
  • 2 in conversation