Hi, I am trying to perform k-means cluster analysis on a dataset with 20 variables and 9000 observations. I want to create 2 new variables (Usage and Payment Ratio) using 4 (Balance, Limit, Payment, Minimum Payment) of the 20 variables in the dataset. Ex: Usage = (Balance/Limit) and Payment_ratio = (payment/minimum payment). This is because of the fact that I would now have to use 2 variables in my analysis as compared to the original 4 variables. Now, should I make these new variables at the start itself, or should I first cap the outliers, remove/impute missing values and then create these 2 new variables? I tried to first clean the data (with original variables) and then created these two variables. Then I also capped the outliers and treated missing values again for these 2 variables. After doing this, these 2 new variables are skewed and I tried taking log, sqrt etc but the variables are still not normal i.e there is still skewness in these variables. Should I go ahead and do Factor Analysis taking the new skewed variables? If not, how should I transform these 2 variables to minimize their skewness? Can anyone please suggest a way to go about this? Thanks.
... View more