Re: New Variable creation: After or before original data cleaning?

Shashank7 · Posted 01-17-2018 12:51 PM

Hi,

I am trying to perform k-means cluster analysis on a dataset with 20 variables and 9000 observations. I want to create 2 new variables (Usage and Payment Ratio) using 4 (Balance, Limit, Payment, Minimum Payment) of the 20 variables in the dataset. Ex: Usage = (Balance/Limit) and Payment_ratio = (payment/minimum payment). This is because of the fact that I would now have to use 2 variables in my analysis as compared to the original 4 variables. Now, should I make these new variables at the start itself, or should I first cap the outliers, remove/impute missing values and then create these 2 new variables?
I tried to first clean the data (with original variables) and then created these two variables. Then I also capped the outliers and treated missing values again for these 2 variables. After doing this, these 2 new variables are skewed and I tried taking log, sqrt etc but the variables are still not normal i.e there is still skewness in these variables. Should I go ahead and do Factor Analysis taking the new skewed variables? If not, how should I transform these 2 variables to minimize their skewness?

Can anyone please suggest a way to go about this? Thanks.

Reeza · Posted 01-17-2018 01:08 PM

Create the variables and then clean your data.

In general, you should usually 'clean' your data before analysis. I'm in the camp that doesn't necessarily believe in capping outliers, because they're often the most interesting portion of the data. The reason, I'm saying create the variables and then clean the data,

is because it makes it easier to see issues in your data and can help you detect irregular data points.

Shashank7 · Posted 01-17-2018 02:44 PM

Hi Reeza, I created these two variables before doing anything and then checked their skewness using proc univariate (histogram), I noticed that both of the newly created variables are extremely skewed. Tried to tansform these variables by taking log, square root etc. but the skewness is still there (although it minimized but did not make the variables normal).

Should I go ahead and simply, cap these new variables (along with all other variables), treat missing values and do FA? If not, how should I transform these 2 variables to minimize their skewness and then cap them and then do FA? 🙂