Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Safe to include high missing percentage variables

Regular Contributor
Posts: 185

Safe to include high missing percentage variables

I'm in the process of building a logistic regression model. Some variables having more than 50% missing values. After missing imputation with zero value, they are helping to improve accuracy significantly.For example, development dataset consists of 1 million records of retail customers. The objective of the model is whether bank should offer Certificate of Deposit (Fixed Deposit) product. We are considering historical data. A very few customers own this product so the variables for this product are having very few values populated and high missing. I used to remove all the variables having % missing greater than 50. Am i doing wrong from a statistical point of view?
SAS Employee
Posts: 122

Re: Safe to include high missing percentage variables



First of all, thank you for your interest in SAS community  and SAS product. My name is Jason Xin, solution architect working at SAS mainly focused on analytics area.


Your treatment of imputing missing values  with zeros on those, I would call, spending categories where non-zeros values are populated sparsely is proper from pure technique  standpoint. And to the truth,  because they  did not spend.


Several ideas I like to share.

1. Try to create set of Boolean indicators 1= if the spending is >0. 0=otherwise. Often the flags are more predictive than interval scales. Depending on specific cases, pay more attention to univariate correlation of such binary flags to the target variables. Some binary flags could be all of sudden so 'relevant' to the target that other variables are blocked from accessing the target.

2.  Explore the possbilities to combine the  individual sparsely spent categories. Sometimes the population % is low due to the modeler breaking down the categories too much. Try to 'prune' back the categories a bit. You can try the same with the Boolean indicators. You can  be  pretty creative engaging AND , OR in this exercises.

3. I know you are building logistic regression models. If you have access to decision trees, test the raw (not imputed) variables with the decision trees. Get some ideas about their informativeness before your imputation. This could be done in parallel to or before  1 and 2 above: sometimes combining with the raw variables as they are make more sense, especially if you need to explain your practice end biz users. Sometimes combining with only the 'siginficant' or informative makes more sense.


Best Regards

Jason Xin

Ask a Question
Discussion stats
  • 1 reply
  • 2 in conversation