BookmarkSubscribeRSS Feed
Shashank7
Fluorite | Level 6

Hi,

 

I am trying to perform k-means cluster analysis on a dataset with 20 variables and 9000 observations. I want to create 2 new variables (Usage and Payment Ratio) using 4 (Balance, Limit, Payment, Minimum Payment) of the 20 variables in the dataset. Ex: Usage = (Balance/Limit) and Payment_ratio = (payment/minimum payment). This is because of the fact that I would now have to use 2 variables in my analysis as compared to the original 4 variables. Now, should I make these new variables at the start itself, or should I first cap the outliers, remove/impute missing values and then create these 2 new variables?
I tried to first clean the data (with original variables) and then created these two variables. Then I also capped the outliers and treated missing values again for these 2 variables. After doing this, these 2 new variables are skewed and I tried taking log, sqrt etc but the variables are still not normal i.e there is still skewness in these variables. Should I go ahead and do Factor Analysis taking the new skewed variables? If not, how should I transform these 2 variables to minimize their skewness?

 

Can anyone please suggest a way to go about this? Thanks.

2 REPLIES 2
Reeza
Super User

Create the variables and then clean your data. 

 

In general, you should usually 'clean' your data before analysis. I'm in the camp that doesn't necessarily believe in capping outliers, because they're often the most interesting portion of the data. The reason, I'm saying create the variables and then clean the data, 

is because it makes it easier to see issues in your data and can help you detect irregular data points. 

Shashank7
Fluorite | Level 6
Hi Reeza, I created these two variables before doing anything and then checked their skewness using proc univariate (histogram), I noticed that both of the newly created variables are extremely skewed. Tried to tansform these variables by taking log, square root etc. but the skewness is still there (although it minimized but did not make the variables normal).

Should I go ahead and simply, cap these new variables (along with all other variables), treat missing values and do FA? If not, how should I transform these 2 variables to minimize their skewness and then cap them and then do FA? 🙂



SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1773 views
  • 1 like
  • 2 in conversation