BookmarkSubscribeRSS Feed
Ujjawal
Quartz | Level 8
I'm in the process of building a logistic regression model. Some variables having more than 50% missing values. After missing imputation with zero value, they are helping to improve accuracy significantly.For example, development dataset consists of 1 million records of retail customers. The objective of the model is whether bank should offer Certificate of Deposit (Fixed Deposit) product. We are considering historical data. A very few customers own this product so the variables for this product are having very few values populated and high missing. I used to remove all the variables having % missing greater than 50. Am i doing wrong from a statistical point of view?
1 REPLY 1
JasonXin
SAS Employee

,

 

First of all, thank you for your interest in SAS community  and SAS product. My name is Jason Xin, solution architect working at SAS mainly focused on analytics area.

 

Your treatment of imputing missing values  with zeros on those, I would call, spending categories where non-zeros values are populated sparsely is proper from pure technique  standpoint. And to the truth,  because they  did not spend.

 

Several ideas I like to share.

1. Try to create set of Boolean indicators 1= if the spending is >0. 0=otherwise. Often the flags are more predictive than interval scales. Depending on specific cases, pay more attention to univariate correlation of such binary flags to the target variables. Some binary flags could be all of sudden so 'relevant' to the target that other variables are blocked from accessing the target.

2.  Explore the possbilities to combine the  individual sparsely spent categories. Sometimes the population % is low due to the modeler breaking down the categories too much. Try to 'prune' back the categories a bit. You can try the same with the Boolean indicators. You can  be  pretty creative engaging AND , OR in this exercises.

3. I know you are building logistic regression models. If you have access to decision trees, test the raw (not imputed) variables with the decision trees. Get some ideas about their informativeness before your imputation. This could be done in parallel to or before  1 and 2 above: sometimes combining with the raw variables as they are make more sense, especially if you need to explain your practice end biz users. Sometimes combining with only the 'siginficant' or informative makes more sense.

 

Best Regards

Jason Xin

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 781 views
  • 0 likes
  • 2 in conversation