BookmarkSubscribeRSS Feed
ningistine7
Calcite | Level 5

I'm performing a cluster analysis with a dateset that contain over 100 varibles (after imputing, replacing and eliminating correlated vars ..)

 

Before hiting the clustering, for the transformation node, should I tranform all variables with LOG10 or do the standarsization ? and for contuniois vars that can be regrouped in interval (revenue for exemple), do I need to transfor it with the Bucket option ? 

OR I have to see every variable (after been cutted to 99% percentil to elimiante outlier) if it's skewed then apply the log10, and for the rest, apply the z-score ?

OR even do not transfor any variable if I'm going to use the Minkowski distance in K-means ?

2 REPLIES 2
DougWielenga
SAS Employee

I'm performing a cluster analysis with a dateset that contain over 100 varibles (after imputing, replacing and eliminating correlated vars ..)

Before hiting the clustering, for the transformation node, should I tranform all variables with LOG10 or do the standarsization ? and for contuniois vars that can be regrouped in interval (revenue for exemple), do I need to transfor it with the Bucket option ? 

OR I have to see every variable (after been cutted to 99% percentil to elimiante outlier) if it's skewed then apply the log10, and for the rest, apply the z-score ?

OR even do not transfor any variable if I'm going to use the Minkowski distance in K-means ?

 

The short answer is that there is not a right way or a wrong way to cluster, but some approaches might be more useful than others when considering your business question of interest. 

 

Here are some things to consider...

 

  * Interpretation:  If you wish to understand how your clusters differ, you might consider using multiple cluster solutions, each using a rational subset of variables that are likely related.  Clusters built on large groups of variables are still largely driven by a small subset of those variables since after accounting for some variables with a large amount of variation, there is often much less variability to further separate on the remaining variables. 

 

  * Transformation: If your goal is interpretation, then some transformations do not make as much intuitive sense.  For example, it is easy to think in terms of dollars as an input to clustering, but it is much less natural to think of log(dollars) or sqrt(dollars).  Leaving the variables non-standardized allows those variables with high variability to drive much of the clustering, but standardizing the variables gives each variable a better opportunity to have an impact on cluster formation.   Having said that, the more variables there are, the harder it is to makes sense of the clusters that form, and the more likely it is that there are some variables that are actually more important to you than others.  Creating multiple cluster solutions using different subsets of variables allows you to see structure in individual clusters while allowing individuals that are similar based on one subset of variables to be different when considering a different subset of variables.  In other situations such as surveys, the questions showing with higher variability actually separate respondents while those questions with little variability do not, so standardizing in this situation would make all questions equally important whether or not they actually do a good job of separating people or not which makes no sense.    If you are trying to create a solution that will be monitored over time, perhaps it makes sense to consider standardizing so that the variability in the initial training data bias the selection of which variables are driving the solution.

 

  * Outlier Removal: Excluding outliers can cause big problems in certain clustering solutions, while keeping them can cause problems for others.   In situations where I am looking for unusual behavior (e.g. fraud), keeping the outliers will allow me to see those observations which fall outside the norm.  Outliers can also be useful for identifying new opportunities.   They are problematic, though in situations where I am trying to craft marketing campaigns where I am trying to hit larger sectors of the population and the small outlier clusters that form do not represent a large enough sector to warrant separate treatment.  

 

In the end, there is not a right or wrong cluster solution so the metrics that exist are no substitute for evaluating the actual cluster solution(s) in light of your business questions.   Build a variety of different solutions based on some of these considerations and then assess which one is most useful in light of your business objectives.

 

Hope this helps!

Doug

ningistine7
Calcite | Level 5

Thank you very much for this detailed answer 🙂

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1081 views
  • 1 like
  • 2 in conversation