## Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Yes, I am trying to use PROC OPTMODEL in SAS 9.4 on a Linux grid to do monotonic supervised optimal binning of an ordinal predictor variable with a binary target (although continuous targets can be used, too).  I have an implementation of a (seemingly) correct formulation, but so far it is consistently outperformed by pure brute force exhaustive enumeration, so I thought this would be a good opportunity to appeal to the omniscience of the SAS® community hive mind to see whether any concrete improvements can be found.

*********************************************************************************

First some BACKGROUND:

(Apologies for the length, you can skip this and go to MY FORMULATION below if you just want to get to the heart of the problem.)

For those unfamiliar with the term, binning of an interval variable entails partitioning its range into an exhaustive, disjoint, discrete collection of subintervals.  For example, if the range of x is [0, 10], then one possible set of three bins would be x1 = [0, 2.8], x2 = (2.8, 6.3], x3 = (6.3, 10].  Binning is also referred to as bucketing, classing, discretizing, grouping, or partitioning.  The two most common forms of unsupervised binning are equal width and equal frequency (based on a data sample).  An equal width example for the x variable above would be x1 = [0, 2.5], x2 = (2.5, 5], x3 = (5, 7.5], x4 = (7.5, 10].  There are other kinds of unsupervised binning methods, too.  SAS/STAT has PROC HPBIN to do efficient unsupervised binning.

In supervised binning, the bins are chosen to magnify the relationship between the variable under consideration and a target variable.  One of the very first such binning algorithms was described by Walter Fisher in the Journal of the American Statistical Association in 1958, “On Grouping for Maximum Homogeneity,” as an attempt to minimize the within group variances of an interval target.  Many more discretization algorithms have been devised; a useful summary can be found in the 2013 IEEE Transactions on Knowledge and Data Engineering article, “A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning.”  Some of these techniques have been referred to as “optimal binning,” although few of them are optimal in the strict sense of guaranteeing the maximum or minimum value of a computable objective function.  The Transform Variables node of SAS® Enterprise MinerTM has an “optimal binning transformation” that uses (non-optimal) decision trees.  Fisher used a truly optimal dynamic programming procedure that has been rediscovered and published in prestigious journals a handful of times over the years.  A frequent contributor to this forum, @RobPratt, joined this exalted company of “Fisher propagators” in 2016 when he cobbled together a version of the dynamic programming binning scheme as a response to the communities.sas.com thread “Finding the optimal cut-off for each bucket.”  (There are a couple of R packages with implementations of Fisher’s algorithm: classInt and cartography; also the Excel add-in xlstat)  Simply put, the main principle of the dynamic programming algorithm is that if you have the best set of bins over the entire range, then the subset of bins over any sub-range is the best set of bins for that sub-range; otherwise, you could swap in a better set of bins for the sub-range and thereby improve the binning of the whole range.  This allows the algorithm to build the set of bins like a proof by induction: if you have the best binnings of points 1,…,k for every k ≤ n, then you can construct the best binning for 1,…,n+1.

There is a constituency for binning in the credit scorecard modeling arena who have a particular requirement for the bins they produce: monotonicity of target response.  For binary event predictive models, this is usually defined in terms of the Weight of Evidence (WoE), which, within each bin, is just the log odds of the target variable for the bin minus the log odds of the entire data set, i.e.:

log(# bin events / # bin non-events)       -              log(# total events / # total non-events)

If the bins are numbered consecutively from left (smallest predictor value) to right (largest predictor value), then increasing (decreasing, resp.) monotonicity means that the WoE of bin j is ≤ (≥, resp.) the WoE of bin (j+1).  Note that monotonicity of WoE and monotonicity of event rate are exactly equivalent whenever WoE is defined; event rate monotonicity also can include bins with all events or all non-events, although WoE will be undefined for such bins.  But if you require monotonicity, you can see that the dynamic programming scheme won’t work: if you obtain the best monotonic binning for points 1,…,n, there’s no guarantee you can extend it monotonically to (n+1) and beyond.  Unfortunately, the long list of binning algorithms in the IEEE article, optimal or not, won’t help you out; there’s not even a passing mention of the monotonicity constraint.  What to do?

If we just concentrate on achieving monotonicity, we can use isotonic regression, for which, if x[i] are the predictor values and y[i] are the response values, 1 ≤ i ≤ N, we attempt to find a transformation f(x[i]) that minimizes the sum from 1 to N of (y[i] – f(x[i]))**2, where f(x[i]) ≤ f(x[i+1]) (or f(x[i]) ≥ f(x[i+1]), resp) for increasing (decreasing, resp.) monotonicity.  This transformation is optimal in the least sum of squares sense by design, and there are reasonably efficient algorithms to compute it.  If the response variable, y[i], is already correspondingly monotonic, then f(x[i]) = y[i], and the transformation is perfect replication.  But for most binary response variables, this will not be the case.  In fact, as Wensui Liu has demonstrated in his wonderful blog on statistical computing, cited in the communities.sas.com thread “Optimal monotonic binning,” the isotonic transform will consist of piecewise constant subintervals, and in each subinterval the values of f(x[i]) will equal the average event rate over that subinterval; in other words, bins.  So, isotonic regression produces optimal monotonic binning!  The “catch” is that you have absolutely no control over the number of bins or their sizes.  Can you acquire control over the number of bins and their sizes and still retain optimality and monotonicity?

Credit Scoring for SAS® Enterprise MinerTM is an add-on product aimed specifically at credit scorecard modelers, and its “Interactive Grouping” node has a method called “Constrained Optimized Binning” that appears inspired by isotonic regression, and is designed to attain monotonicity and optimality within additional user-specified bounds.  (Note that this node is distinct from the general SAS EM “Interactive Binning” node that does non-monotonic, non-optimal binning with user-specified bounds.)  But there’s still a catch (besides the additional cost).  From the patent application description, the objective is to minimize the sum of absolute differences between individual WoE values and their associated decision variables, which are proxies for the bin WoE.  This cannot be done directly with pointwise data, for which WoE is undefined; the data must be pre-aggregated into “fine-grained bins” to an extent that each “fine-grained bin” has at least one of both events and one non-events.  Differences in pre-aggregation can affect the final optimality, but the patent description doesn’t include any details on the pre-aggregation part.

*********************************************************************************

MY WOEFUL BINNING FORMULATION:

This is a pure BLIP (Binary Linear Integer Program) formulation, unlike the SAS EM method, which is a full MILP (Mixed Integer Linear Program).  In this formulation, every possible bin gets its own decision variable.  Since bins are just sub-intervals, fully specified by their two endpoints, this means the number of variables (columns) is O(N**2), where N is the number of data points.  This is a big practical disadvantage, although I think it should be less of a disadvantage than it has turned out to be in practice, but I would need to find a more clever formulation.

For the objective function, just compute the appropriate metric, m[i,j], for each bin (i,j), and take the sum over all (i,j) of m[i,j]*v[i,j], where v[i,j] is the corresponding binary decision variable for bin (i,j):
v[i,j] = 1                                if (i,j) is chosen as one of the bins,
v[i,j] = 0                                if (i,j) is not chosen as one of the bins

Some metric examples are:

Chi-square:                         m[i,j]     =             ((N*e[i,j]) – (E*n[i,j]))**2 / (N*e[i,j]*(n[i,j] - e[i,j]))

Information Value:          m[i,j]     =             ((e[i,j] / E) – ((n[i,j] - e[i,j]) / (N - E)))*(log((e[i,j] / E) / ((n[i,j] - e[i,j]) / (N - E))))

Mean Sum of Squares:  m[i,j]     =             (e[i,j]*(n[i,j] - e[i,j]) / (N*n[i,j])

where   n[i,j] is the number of points in (i,j),        e[i,j] is the number of events in (i,j),       N is the total number of points, and E is the total number of events.  Chi-square and Information Value should be maximized over the chosen bins, Mean Sum of Squares should be minimized over the chosen bins.

The constraints are:
1.            Every point should be in exactly one bin.  This can be expressed in different ways, but the number of such constraints is O(N).
2.            Upper bound on number of bins, one constraint if desired.  Sum of all v[i,j] upper bound.
3.            Lower bound on number of bins, one constraint if desired.  Sum of all v[i,j] lower bound.
4.            Upper and / or lower bounds on number of points per bin do not require constraints, just eliminate the decision variables for bins that are outside of the bounds.
5.            Event rate monotonicity.  At any bin endpoint p, the sum over i of ((e[i,p]*v[i,p]) / n[i,p]) must be the sum over j of ((e[(p+1),j]*v[(p+1),j]) / n[(p+1),j])      (for increasing monotonicity, reverse for decreasing).  You can also make the difference a positive constant to require neighboring bins to behave differently.  The number of such constraints is O(N).  On top of the monotonicity constraints, I have experimented with more complex constraints to require event rate differences between neighboring bins to be statistically significant; I needed O(N**2) such constraints in my formulation.

I did some experiments using a well-known, publicly available “German Credit” data set with 1,000 points.  The binary target is called Creditability, with 700 events and 300 non-events, and the continuous predictor is Credit_Amount, with 923 distinct values ranging from 250 to 18,424.  The attached code to read in the data (_import_GRIDWORK.GC_CREDIT_AMOUNT_AGG01IC.sas) gives the results already aggregated to distinct predictor values (nv_predictor_value).  The number of points for each distinct predictor value is nv_w_ind_all, and the number of events for each distinct predictor value is nv_w_ind_one.  I chose this data set because there was a paper about binning at SGF2018, called “Get Better Weight of Evidence for Scorecards Using a Genetic Algorithm.”  Their criteria for good binning is a bit vague, they think the difference of WoE between groups should be as large as possible and look linear.  Here is what they came up with, four monotonically decreasing bins:

 Obs i j wsize wones wzero wrate sos iv woe ent css diffwrate sigdifrate diffwoe 1 1 631 699 513 186 0.7339 0.1365 0.0189 0.1672 0.6629 0.1237 . . . 2 632 764 139 96 43 0.6907 0.0297 0.0003 -0.0442 0.1408 0.0292 -0.0433 0.0835 -0.2114 3 765 821 60 37 23 0.6167 0.0142 0.0089 -0.3719 0.0654 0.0132 -0.0740 0.1451 -0.3277 4 822 923 102 54 48 0.5294 0.0254 0.0604 -0.7295 0.1155 0.0254 -0.0873 0.1566 -0.3576 Total 1,000 700 300 0.2058 0.0884 0.9845 0.1915

Note that the minimum bin size (wsize) is 60 points, the smallest absolute event rate difference (diffwrate) between adjacent bins is 0.0433, and the smallest absolute WoE difference (diffwoe) between adjacent bins is 0.2114.  To measure the linearity, the authors regressed WoE against Obs and found adjusted R**2 of 0.9814.  The overall information value (iv) is 0.0884.

Using the BLIP formulation with the objective of maximizing information value (iv), specifying four bins, with a minimum bin size of 60 and a minimum absolute event rate difference between adjacent bins of 0.08, I found the following (attachment _122_gco099_fullstimer_optmodel_milpsolve_max_iv_clean_4-g_60-minsz_no-maxsz_0.08-mindif_desc_impure_modified_metrics.sas):

 Obs i j wsize wones wzero wrate sos iv woe ent css diffwrate sigdifrate diffwoe 1 1 654 724 537 187 0.7417 0.1387 0.0299 0.2076 0.6771 0.1259 . . . 2 655 782 133 87 46 0.6541 0.0301 0.0061 -0.2100 0.1404 0.0296 -0.0876 0.0869 -0.4176 3 783 862 82 47 35 0.5732 0.0201 0.0274 -0.5525 0.0916 0.0191 -0.0810 0.1342 -0.3425 4 863 923 61 29 32 0.4754 0.0152 0.0617 -0.9457 0.0691 0.0152 -0.0978 0.1648 -0.3932 Total 1,000 700 300 0.2041 0.1250 0.9782 0.1897

Note that the minimum bin size (wsize) is 61 points, the smallest absolute event rate difference (diffwrate) between adjacent bins is 0.0810, and the smallest absolute WoE difference (diffwoe) between adjacent bins is 0.3425.  To measure the linearity, regressing WoE against Obs gives adjusted R**2 of 0.9980.  The overall information value (iv) is 0.1250.  It took much longer to run the optimization than to find the answer by exhaustive enumeration.

Using brute force exhaustive enumeration, it is possible to find a set of four bins whose minimum event rate difference is 0.08573 and minimum WoE difference is 0.3623.  Its adjusted R**2 is 0.9991, and its overall information value (iv) is 0.1222.  I was able to modify my BLIP formulation to formulate a MILP to find the binning with the maximum possible minimum event rate difference, but the MILP ran out of memory and would not converge on this data.

I also ran the BLIP where I still used the monotonicity constraints, but removed the minimum event rate difference specification and added in the significant difference constraints.  If I removed the constraints on the number of bins, it ran out of memory and did not converge.  I was able to get it to run to completion with an upper bound of seven bins, no upper or lower limit on the number of points per bin.  The solution has three bins (attachment _123_gco100_fullstimer_optmodel_milpsolve_max_iv_clean_1-7-g_60-minsz_no-maxsz_0-mindif_sigrdif_desc_impure_.sas):

 Obs i j wsize wones wzero wrate sos iv woe ent css diffwrate sigdifrate diffwoe 1 1 668 739 550 189 0.7443 0.1407 0.0344 0.2209 0.6878 0.1278 . . . 2 669 848 186 116 70 0.6237 0.0437 0.0231 -0.3422 0.2017 0.0422 -0.1206 0.0764 -0.5631 3 849 923 75 34 41 0.4533 0.0186 0.0911 -1.0345 0.0846 0.0186 -0.1703 0.1324 -0.6923 Total 1,000 700 300 0.2029 0.1487 0.9741 0.1886

I ran many other experiments, but most of the time the optimization would not converge, even when the number of possible solutions was only a few million.  When the optimization does converge, it takes a long time to do so.  If you look at my optimization code and see some ways to improve it, please post!  Thanks!

49 REPLIES 49  Ksharp
Super User

## Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Hey. That is my paper.  the criteria of woe is coming from Naqi's book.

This optimal problem is complicated, so I solved it by GA, not sure if SAS/OR can solve it.

You also could try :

%let var=duration;
%let group=6 ;
%let n_iter=100;

if that line is not linear , reduce the group number:

%let var=duration;
%let group=5 ;
%let n_iter=100;

And so on....

%let var=duration;
%let group=4 ;
%let n_iter=100;

until you get a line .

## Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Hi @Ksharp !

Thank you so much for responding.  I enjoyed your paper, you took a very interesting approach, I think there's a lot of potential for GAs.  Your solution is quite practical to fill your needs, and probably runs waaaaay faster than the BLIP / MILP formulations.  But if you're interested in tinkering further with your algorithm, now you also know that there are four bin solutions for the German Credit data that are more linear, have greater differences between bin WoE values, and higher overall information value than what you found.  There are only 130 million possible four bin solutions for the German Credit data, so if you continue to tune your approach, I'll bet you can improve it quickly, if you wish to do so.  Good luck!

## Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Hmmmm, let's see if I can correct some of my own mistakes.

Okay, for the last optimization I displayed, the one using the attachment _123_gco100_fullstimer_optmodel_milpsolve_max_iv_clean_1-7-g_60-minsz_no-maxsz_0-mindif_sigrdif_desc_impure_.sas, I claimed there was no upper or lower limit on the number of points per bin, but there is a lower bound of 60 points per bin.  And in both of the optimization programs, it also requires that each bin have at least one event and at least one non-event; otherwise it would try to choose an all-event bin at the left end, and an all-non-event bin on the right side (although the 60 point minimum size should overcome that problem for this data set).

I think the formulas I gave for information value (iv column in my tables) and mean sum of squares (sos column in my tables) are correct.  I didn't include chi-square in the tables, and it looks like I got the formula wrong.  I believe the correct chi-square formula is:

m[i,j]     =             ((N*e[i,j]) – (E*n[i,j]))**2 / (n[i,j]*E*(N - E))

Also, I didn't mention this, but one way I like to do monotonic binning that works for pretty large size data sets without problems is to start with the isotonic regression, and then follow that by running Fisher's dynamic programming algorithm on the results to constrain the number and sizes of the bins.  DP can't impose monotonicity, but once the data is already monotonic, like the results from the isotonic regression, it can't break that monotonicity.  The final results aren't guaranteed optimal, but they often agree with the optimal solution under the desired conditions.  If people are interested, I can post some results and discuss more.  Thanks!  Ksharp
Super User

## Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Sorry . I do not know OR well. @RobPratt  might give you a hand.

According Naqi's book, the good/bad distribution must >= 0.05 . you could see this in my IML code .

/*
proc import datafile='c:\temp\1--German Credit.xlsx' dbms=xlsx out=have replace;
run;
*/

%let var=duration;
%let group=6 ;
%let n_iter=100;

data temp;
set have;
run;

proc sql noprint;
floor(min(&var)),ceil(max(&var)) into : n_bad,: n_good,: min,: max
from temp;
quit;
proc sort data=temp;by &var ;run;
proc iml;
use temp(where=(&var is not missing));
close;

if countunique(x)=group-1 then do;

col_x=t(x);
call sort(col_x,1);
cutpoints= .M//col_x//.I ;
b=bin(&var ,cutpoints,'right');

if countunique(b)=group then do;
do i=1 to group;
idx=loc(b=i);
n_good=sum(temp='good');
good_dist=n_good/&n_good ;
else woe[i]=.;
end;

if countmiss(woe)=0 then do;
/*
xx=j(group,1,1)||woe||woe##2;
*/
xx=j(group,1,1)||woe;
beta=solve(xx`*xx,xx`*bin);
yhat=xx*beta;
sse=ssq(bin-yhat);
end;
else sse=999999;

end;
else sse=999999;

end;
else sse=999999;

return (sse);
finish;

group=&group ;
bin=t(1:group);
woe=j(group,1,.);

encoding=j(2,group-1,&min );
encoding[2,]=&max ;

id=gasetup(2,group-1,123456789);
call gasetobj(id,0,"function");
call gasetsel(id,10,1,1);
call gainit(id,1000,encoding);

niter = &n_iter ;
do i = 1 to niter;
call garegen(id);
call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);

col_mem=t(mem);
call sort(col_mem,1);
cutpoints= .M//col_mem//.I ;
b=bin(&var ,cutpoints,'right');

create cutpoints var {cutpoints};
append;
close;
create group var {b};
append;
close;

print value[l = "Min Value:"] ;
call gaend(id);
quit;

data all_group;
set temp(keep=&var rename=(&var=b) where=(b is missing)) group;
run;
data all;
merge all_group temp;
rename b=group;
run;

title "变量: &var" ;
proc sql;
create table woe_&var as
select group label=' ',
min(&var) as min label='最小值',max(&var) as max label='最大值',count(*) as n label='频数',
calculated n/(select count(*) from all) as per format=percent7.2 label='占比',
from all
group by group
order by woe;

create index group on woe_&var;

select *,sum( (Bad_Dist-Good_Dist)*woe ) as iv
from woe_&var ;

quit;
title ' ';

data fmt_&var ;
set cutpoints;
start=lag(cutpoints);
end=cutpoints;
if start=.M then hlo='IL';
if end=.I then hlo='IH';
if _n_ ne 1 then do;group+1;output;end;
run;
data fmt_&var(index=(group));
merge fmt_&var woe_&var(keep=group woe);
by group;
retain fmtname "&var" type 'I';
keep group fmtname type start end woe hlo;
rename woe=label;
label group=' ';
run;
proc format cntlin=fmt_&var library=z;
run;

/*
proc print data=woe_&var noobs label;run;
proc sgplot data=woe_&var;
reg y=group x=woe/degree=2 cli clm jitter;
run;
*/
proc sgplot data=woe_&var noautolegend;
vbar group/response=woe nostatlabel missing;
vline group/response=woe nostatlabel missing markers MARKERATTRS=(symbol=circlefilled
size=12) MARKERFILLATTRS=(color=white) MARKEROUTLINEATTRS=graphdata1
FILLEDOUTLINEDMARKERS;
run;

ods select fitplot;
proc reg data=woe_&var;
model group=woe/ cli clm ;
quit;

proc copy in=work out=z;
select woe_: fmt_: ;
run;  Ksharp
Super User

## Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

 Obs i j wsize wones wzero wrate sos iv woe ent css diffwrate sigdifrate diffwoe 1 1 654 724 537 187 0.7417 0.1387 0.0299 0.2076 0.6771 0.1259 . . . 2 655 782 133 87 46 0.6541 0.0301 0.0061 -0.2100 0.1404 0.0296 -0.0876 0.0869 -0.4176 3 783 862 82 47 35 0.5732 0.0201 0.0274 -0.5525 0.0916 0.0191 -0.0810 0.1342 -0.3425 4 863 923 61 29 32 0.4754 0.0152 0.0617 -0.9457 0.0691 0.0152 -0.0978 0.1648 -0.3932 Total 1,000 700 300 0.2041 0.1250 0.9782 0.1897

and not satisfy the condition >=0.05 (which is from book:

Siddiqi, Naeem. 2006. Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring

).

But mine satisfy this criteria .

## Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Hi @Ksharp !

Thank you for following up.  I modified my optimization program to include your distribution criteria.  Here is a four bin solution which meets those requirements:

 Obs i j wsize wones wzero wrate sos iv woe ent css diffwrate sigdifrate diffwoe 1 1 650 720 533 187 0.7403 0.1384 0.0276 0.2001 0.6751 0.1256 . . . 2 651 771 125 83 42 0.6640 0.0279 0.0036 -0.1661 0.1306 0.0274 -0.0763 0.0888 -0.3662 3 772 831 63 37 26 0.5873 0.0153 0.0167 -0.4945 0.0699 0.0143 -0.0767 0.1471 -0.3284 4 832 923 92 47 45 0.5109 0.0230 0.0666 -0.8038 0.1044 0.0230 -0.0764 0.1588 -0.3093 Total 1,000 700 300 0.2046 0.1145 0.9800 0.1903

The smallest bin has 63 points, the smallest WoE difference is 0.3093, the smallest event rate difference is 0.0763, the information value is 0.1145, and the adjusted R-square is 0.9978.  Thanks!

## Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Hi @Ksharp !

I also reworked the final, significant difference optimization to use your good/bad distribution criteria (attachment _125_gco102_fullstimer_optmodel_milpsolve_max_iv_clean_1-5-g_60-minsz_no-maxsz_0-mindif_sigrdif_desc_impure-naqi_modified_metrics_.sas).  In this case I cut down from seven to five bin upper bound (so that it wouldn't take all day to run), but kept the minimum size at 60.  Once again, it chooses a three bin solution:

 Obs i j wsize wones wzero wrate sos iv woe ent css diffwrate sigdifrate diffwoe 1 1 668 739 550 189 0.7443 0.1407 0.0344 0.2209 0.6878 0.1278 . . . 2 669 845 183 114 69 0.6230 0.0430 0.0232 -0.3452 0.1985 0.0415 -0.1213 0.0769 -0.5661 3 846 923 78 36 42 0.4615 0.0194 0.0887 -1.0015 0.0881 0.0194 -0.1614 0.1310 -0.6562 Total 1,000 700 300 0.2030 0.1463 0.9745 0.1887

Now the smallest bin has 78 points, the event rate differences are at least 0.1213 and statistically significant at 95% confidence, the WoE differences are at least 0.5661, and the IV is 0.1463.  While it's a bit silly to talk about linearity with only three points, the adjusted R-square is 0.9964.  This seems like a decently robust set of bins.  And I think it satisfies all your criteria.  Thanks!  Ksharp
Super User

## Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

What an interesting thing.

As the number of bin go up, the IV should be larger too .

But you got the different thing, interesting .

## Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Hi @Ksharp !

You appear to admire Siddiqi's book, and I must confess I have not read it, but there are two things you cite as having emanated from there that I would love to have explained to me.

First, the idea that the bin WoEs should have a linear trend.  I would think one would want to get maximum separation of events and non-events, which would be actualized by having bins with extreme values of WoE, positive and negative.  Linear means you're going to have one or two bins with WoE around zero, which is the same characteristic as the entire population, and thus uninformative.  Furthermore, even if you want linearity, why would you treat all the bins equally, as you do in your regression, when they have very different sizes?  Why wouldn't you weight them by size?

My second question is about the good / bad distribution 5% lower bound rule.  I can understand having a lower bound on the size of each whole bin, you want to ensure a chosen bin is not just a random fluctuation.  But, especially at the extreme ends, wouldn't you like to have a bin at one end that is heavily skewed to events, and a bin at the other end that leans mightily toward non-events?  If the bins are large enough overall, why do you care that they have at least 5% of both categories?  That's why I like the statistical significance condition, it pretty much bakes in the requirement of sufficient size without having to set it explicitly.

I would love to get your thoughts, or anyone else's, about these issues.

Thanks!  Ksharp
Super User

## Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Hi Top,

For your first Q, I just check  if woe is linear , not really do a Regression Model or Weight Regression .

## Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Hi @Ksharp !

Thank you for responding.  I asked the linearity question, because I saw the following in your paper:

1. (page 1)  "3) For the continuous variable (e.g. age), WOE should be monotonous increase or decrease, better is
linear."

2. (page 2)  "For the continuous variable, since its WOE must be monotonous increase or decrease, so I fit a linear
regression model, take WOE as x variable, group number (1 2 3 4 …) as y variable"

3. (page 4)  "proc reg data=woe_&var;
model group=woe/ cli clm ;
quit;"

Then I ran the same regression on the examples I generated.  But are you saying that you normally don't run the regression, you just look for linearity by eye?  Either way, what I'm really wondering is whether a "linear" solution, in which some of the bins would have WoE close to zero, is somehow preferable to a solution that does not look linear, but has all bins with either very positive or very negative WoE.  It seems to me that bins with WoE near zero don't do any better than random guessing for separating events from non-events.  Ksharp
Super User

## Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Hi Top,

"But are you saying that you normally don't run the regression, you just look for linearity by eye? "

Yeah.You could check it linear by eyes ,no need do PROC REG.

"Either way, what I'm really wondering is whether a "linear" solution, in which some of the bins would have WoE close to zero, is somehow preferable to a solution that does not look linear, but has all bins with either very positive or very negative WoE.  "

No. must be look like linear due to the assumption of Logistic Model, which is under GLM framework , all these LINEAR model better have linear relation between Y and X, that is why to bin to get linear WOE . you could bin it with U or reverse U ,but that is not easy to explain.

"It seems to me that bins with WoE near zero don't do any better than random guessing for separating events from non-events."

"WoE near zero" don't have any predicted power,but that is pay to build score card model. you can't get away from it .

Thanks, That is just my opinion.

Hope SAS employee Sidd (the author of the book) could appear here and say something .

## Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Hi @Ksharp !

Thank you for continuing the conversation.  From your comments, "all these LINEAR model better have linear relation between Y and X," we are certainly in agreement there.  And, of course, in the case of logistic regression, that linear relationship is specifically between the log odds of Y and the linear predictor X.  For binning, the linear predictor X is either: the original variable transformed to the bin WoE values (as a single DoF), or the set of indicator functions of the individual bins (for multiple DoF).  But that linear relationship is not the same as plotting the bin WoE values in sequential order and expecting the result to look like a line.  Although you say you normally look for this linearity by eye, in your code you actually regressed bin sequence number against WoE value; I think that may be a spurious relationship.  Does Siddiqi claim that the sequence of bin WoE values should look like a line?  If so, where in the book is that claim?  Ksharp
Super User

## Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

“But that linear relationship is not the same as plotting the bin WoE values in sequential order and expecting the result to look like a line. ”

Yeah. I know the reason ,it is because Link function. But it must be monotonic relationship. And I also prefer linear relationship like I said before due to logistic model is still a GLM .

"Although you say you normally look for this linearity by eye, in your code you actually regressed bin sequence number against WoE value; I think that may be a spurious relationship. "

I don't think so. I do regression model to check linear of woe,  And in IML code doing regression model is trying to make woe linear.

and plot woe to see if it is linear and it is .

You also could use 10 20 30 to do regression, but that must have same step to make woe linear and larger between each other.

"Does Siddiqi claim that the sequence of bin WoE values should look like a line?  If so, where in the book is that claim?"  Discussion stats
• 49 replies
• 2843 views
• 11 likes
• 2 in conversation