Hi,
I am going to conduct a cluster analysis on some skewed data with K-means or hierarchical analysis method.I wonder if I need to do some variable tranformation on skewed variables to make them normal. Could someone provide some suggestion how to choose the tranformation method sicne there is no dependent varibale?
Thanks.
You might look into Proc STDIZE to transform the variables before going to the cluster analysis.
But I would tend to take a quick look at the data using FASTCLUS before transforming to see if you get something interesting first. You may not need to transform if the units of measure are similar for all of the variables.
If your data is clustered (with more than one cluster) then it cannot be multinormal. Transformations can help homogenize cluster covariances, but shouldn't aim at normalizing the data.
I think it's more important to make sure the scales are the same or comparable than for normality.
The assumptions for clustering depend on what type of clustering you intend to implement.
http://stats.stackexchange.com/questions/8148/assumptions-of-cluster-analysis
Box-Cox transformation is used for Normal transformation . Check PROC TRANSREG.
TRANSREG fits univariate and multivariate linear models, optionally with spline, Box-Cox, and
other nonlinear transformations. Models include regression and ANOVA, conjoint
analysis, preference mapping, redundancy analysis, canonical correlation, and penalized
B-spline regression. PROC TRANSREG supports CLASS variables.
Dose Box-Cox transformation require dependent variable? However, There is no dependent variable in my cluster analysis.
No. I think it does not require dependent variable.
Interesting thing is you also can use PROC MCMC to do it .Check its example.
Example 73.2: Box-Cox Transformation
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.