10-09-2015 04:40 PM
11-10-2015 03:54 PM
First of all, thank you for your interest in SAS community and SAS product. My name is Jason Xin, solution architect working at SAS mainly focused on analytics area.
Your treatment of imputing missing values with zeros on those, I would call, spending categories where non-zeros values are populated sparsely is proper from pure technique standpoint. And to the truth, because they did not spend.
Several ideas I like to share.
1. Try to create set of Boolean indicators 1= if the spending is >0. 0=otherwise. Often the flags are more predictive than interval scales. Depending on specific cases, pay more attention to univariate correlation of such binary flags to the target variables. Some binary flags could be all of sudden so 'relevant' to the target that other variables are blocked from accessing the target.
2. Explore the possbilities to combine the individual sparsely spent categories. Sometimes the population % is low due to the modeler breaking down the categories too much. Try to 'prune' back the categories a bit. You can try the same with the Boolean indicators. You can be pretty creative engaging AND , OR in this exercises.
3. I know you are building logistic regression models. If you have access to decision trees, test the raw (not imputed) variables with the decision trees. Get some ideas about their informativeness before your imputation. This could be done in parallel to or before 1 and 2 above: sometimes combining with the raw variables as they are make more sense, especially if you need to explain your practice end biz users. Sometimes combining with only the 'siginficant' or informative makes more sense.