From what i have seen using EM the binning or the grouping is done in accordance to maximize the gini ratio or IV , problem is that although automated it is still necessary to handle the binning manually in order to merge or split bins cause the event rate doesnt make sense otherwise and will for sure cause problems when checking the estimates of your logistic regression (they will have wrong signs).
Ofcourse the more bins the higher the gini ratio but believe me the regression signs will go bananas so its quite difficult to handle a large number of bins.
I will try to create some code keeping into mind a monotonically increasing or decreasing event rate (or WOE if prefered) and play interactively with it just like in EM cause i dont have it now. The gini calculation after that is solved by a sas code.
I was reading this book Mamdouh Refaat - Data Preparation for Data Mining Using SAS - 2006 which has a optimal binning macro code but believe me its too complicated to read through and always gets me errors when trying to apply, long story short i believe there are smaller and more efficient ways for it and that was what i was looking for with my initial enquiry
1. Regarding conflicting signs: while recommending PCA of variables, or variable clustering, may be 'out of place' for binning, you may want to run variable clustering to see what variables are 'together'. That knowledge will help advice you in the variable sign conflicting situation.
2. If you are into binning, manual tweaking appears inevitable. Except if you are building a large number of score cards, you don't care /have time to care about 80% of them. So you set the criteria and as long as they don't fall out of boundary you don't tweak them. That is one of several trends among EM users.
3. If you are still with EM, you may also want to just let the final model performance to help you as well. One of my favorite EM nodes is Model Comparison. Practically, copying and pasting modeling flow is much easier than manual tweaking. Go ahead, be bold, lay out many model path, let the node names reflecting your subtle changes, see what performance has to say. In a more critical aspect, model target definition should follow the same performance practice since the correctness of the target definition is about the 'quality of the war' what binning is local and tactical.
Jason Xin
you need to give tech support a call. we got a fix for proc optbin. long story short, our server was not fast enough for the data we handle. they gave us a hotfix to omit some max time option that was stopping the binning process before time.
if EM is not creating bins where it should, it sounds like you might want to get rid of a couple constraints on the "advanced constrained optimal" options.
but since you have put so much time on this, just let sas figure out what you need to tweak. that's the real value of your sas license, my two cents.
I agree on the three points you make, just one addition on point 1, I believe based on what i have encountered before that you can get the sense that variable clustering offers by checking the IV or Gini ratio of the variables, usually variables with similar values on those metrics tend to be conflicting, the beauty though of the logistic regression is that it will eliminate one of them during the backward procedure.
You just said a very important word, SENSE. Optimal binning should never be taken as to mean one-pass through. Thanks.
Jason Xin
HI Chemicalab
Even I need the cutoff points for getting optimal bins...please let know if you get the answer.. my mailid---chaitu_chnk@yahoo.com
What I do sometimes is binning the variable into 8 bins, the way that each one of the bins contains an equal portion of the positive targets in the dataset. It's very simple to develop and it give some good and logical results. I'm not sure it will perform that well when modelling rare events.
I wonder if the CLP procedure might be applicable. This procedure includes a PACK constraint for placing weighted items in bins, subject to capacity constraints. Here is the documentation: SAS/OR(R) 13.1 User's Guide: Constraint Programming
Thanks!
Lindsey
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.