Safe to include high missing percentage variables

Ujjawal · Posted 10-09-2015 04:40 PM

I'm in the process of building a logistic regression model. Some variables having more than 50% missing values. After missing imputation with zero value, they are helping to improve accuracy significantly.For example, development dataset consists of 1 million records of retail customers. The objective of the model is whether bank should offer Certificate of Deposit (Fixed Deposit) product. We are considering historical data. A very few customers own this product so the variables for this product are having very few values populated and high missing. I used to remove all the variables having % missing greater than 50. Am i doing wrong from a statistical point of view?

Ujjawal · Posted 11-10-2015 03:54 PM

Ujjawal ,

First of all, thank you for your interest in SAS community and SAS product. My name is Jason Xin, solution architect working at SAS mainly focused on analytics area.

Your treatment of imputing missing values with zeros on those, I would call, spending categories where non-zeros values are populated sparsely is proper from pure technique standpoint. And to the truth, because they did not spend.

Several ideas I like to share.

1. Try to create set of Boolean indicators 1= if the spending is >0. 0=otherwise. Often the flags are more predictive than interval scales. Depending on specific cases, pay more attention to univariate correlation of such binary flags to the target variables. Some binary flags could be all of sudden so 'relevant' to the target that other variables are blocked from accessing the target.

2. Explore the possbilities to combine the individual sparsely spent categories. Sometimes the population % is low due to the modeler breaking down the categories too much. Try to 'prune' back the categories a bit. You can try the same with the Boolean indicators. You can be pretty creative engaging AND , OR in this exercises.

3. I know you are building logistic regression models. If you have access to decision trees, test the raw (not imputed) variables with the decision trees. Get some ideas about their informativeness before your imputation. This could be done in parallel to or before 1 and 2 above: sometimes combining with the raw variables as they are make more sense, especially if you need to explain your practice end biz users. Sometimes combining with only the 'siginficant' or informative makes more sense.

Best Regards

Jason Xin

Safe to include high missing percentage variables

Re: Safe to include high missing percentage variables