Hi,
I am trying to perform k-means cluster analysis on a dataset with 20 variables and 9000 observations. I want to create 2 new variables (Usage and Payment Ratio) using 4 (Balance, Limit, Payment, Minimum Payment) of the 20 variables in the dataset. Ex: Usage = (Balance/Limit) and Payment_ratio = (payment/minimum payment). This is because of the fact that I would now have to use 2 variables in my analysis as compared to the original 4 variables. Now, should I make these new variables at the start itself, or should I first cap the outliers, remove/impute missing values and then create these 2 new variables?
I tried to first clean the data (with original variables) and then created these two variables. Then I also capped the outliers and treated missing values again for these 2 variables. After doing this, these 2 new variables are skewed and I tried taking log, sqrt etc but the variables are still not normal i.e there is still skewness in these variables. Should I go ahead and do Factor Analysis taking the new skewed variables? If not, how should I transform these 2 variables to minimize their skewness?
Can anyone please suggest a way to go about this? Thanks.
Create the variables and then clean your data.
In general, you should usually 'clean' your data before analysis. I'm in the camp that doesn't necessarily believe in capping outliers, because they're often the most interesting portion of the data. The reason, I'm saying create the variables and then clean the data,
is because it makes it easier to see issues in your data and can help you detect irregular data points.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.