Yes, I am trying to use PROC OPTMODEL in SAS 9.4 on a Linux grid to do monotonic supervised optimal binning of an ordinal predictor variable with a binary target (although continuous targets can be used, too). I have an implementation of a (seemingly) correct formulation, but so far it is consistently outperformed by pure brute force exhaustive enumeration, so I thought this would be a good opportunity to appeal to the omniscience of the SAS® community hive mind to see whether any concrete improvements can be found.
*********************************************************************************
First some BACKGROUND:
(Apologies for the length, you can skip this and go to MY FORMULATION below if you just want to get to the heart of the problem.)
For those unfamiliar with the term, binning of an interval variable entails partitioning its range into an exhaustive, disjoint, discrete collection of subintervals. For example, if the range of x is [0, 10], then one possible set of three bins would be x1 = [0, 2.8], x2 = (2.8, 6.3], x3 = (6.3, 10]. Binning is also referred to as bucketing, classing, discretizing, grouping, or partitioning. The two most common forms of unsupervised binning are equal width and equal frequency (based on a data sample). An equal width example for the x variable above would be x1 = [0, 2.5], x2 = (2.5, 5], x3 = (5, 7.5], x4 = (7.5, 10]. There are other kinds of unsupervised binning methods, too. SAS/STAT has PROC HPBIN to do efficient unsupervised binning.
In supervised binning, the bins are chosen to magnify the relationship between the variable under consideration and a target variable. One of the very first such binning algorithms was described by Walter Fisher in the Journal of the American Statistical Association in 1958, “On Grouping for Maximum Homogeneity,” as an attempt to minimize the within group variances of an interval target. Many more discretization algorithms have been devised; a useful summary can be found in the 2013 IEEE Transactions on Knowledge and Data Engineering article, “A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning.” Some of these techniques have been referred to as “optimal binning,” although few of them are optimal in the strict sense of guaranteeing the maximum or minimum value of a computable objective function. The Transform Variables node of SAS® Enterprise MinerTM has an “optimal binning transformation” that uses (nonoptimal) decision trees. Fisher used a truly optimal dynamic programming procedure that has been rediscovered and published in prestigious journals a handful of times over the years. A frequent contributor to this forum, @RobPratt, joined this exalted company of “Fisher propagators” in 2016 when he cobbled together a version of the dynamic programming binning scheme as a response to the communities.sas.com thread “Finding the optimal cutoff for each bucket.” (There are a couple of R packages with implementations of Fisher’s algorithm: classInt and cartography; also the Excel addin xlstat) Simply put, the main principle of the dynamic programming algorithm is that if you have the best set of bins over the entire range, then the subset of bins over any subrange is the best set of bins for that subrange; otherwise, you could swap in a better set of bins for the subrange and thereby improve the binning of the whole range. This allows the algorithm to build the set of bins like a proof by induction: if you have the best binnings of points 1,…,k for every k ≤ n, then you can construct the best binning for 1,…,n+1.
There is a constituency for binning in the credit scorecard modeling arena who have a particular requirement for the bins they produce: monotonicity of target response. For binary event predictive models, this is usually defined in terms of the Weight of Evidence (WoE), which, within each bin, is just the log odds of the target variable for the bin minus the log odds of the entire data set, i.e.:
log(# bin events / # bin nonevents)  log(# total events / # total nonevents)
If the bins are numbered consecutively from left (smallest predictor value) to right (largest predictor value), then increasing (decreasing, resp.) monotonicity means that the WoE of bin j is ≤ (≥, resp.) the WoE of bin (j+1). Note that monotonicity of WoE and monotonicity of event rate are exactly equivalent whenever WoE is defined; event rate monotonicity also can include bins with all events or all nonevents, although WoE will be undefined for such bins. But if you require monotonicity, you can see that the dynamic programming scheme won’t work: if you obtain the best monotonic binning for points 1,…,n, there’s no guarantee you can extend it monotonically to (n+1) and beyond. Unfortunately, the long list of binning algorithms in the IEEE article, optimal or not, won’t help you out; there’s not even a passing mention of the monotonicity constraint. What to do?
If we just concentrate on achieving monotonicity, we can use isotonic regression, for which, if x[i] are the predictor values and y[i] are the response values, 1 ≤ i ≤ N, we attempt to find a transformation f(x[i]) that minimizes the sum from 1 to N of (y[i] – f(x[i]))**2, where f(x[i]) ≤ f(x[i+1]) (or f(x[i]) ≥ f(x[i+1]), resp) for increasing (decreasing, resp.) monotonicity. This transformation is optimal in the least sum of squares sense by design, and there are reasonably efficient algorithms to compute it. If the response variable, y[i], is already correspondingly monotonic, then f(x[i]) = y[i], and the transformation is perfect replication. But for most binary response variables, this will not be the case. In fact, as Wensui Liu has demonstrated in his wonderful blog on statistical computing, cited in the communities.sas.com thread “Optimal monotonic binning,” the isotonic transform will consist of piecewise constant subintervals, and in each subinterval the values of f(x[i]) will equal the average event rate over that subinterval; in other words, bins. So, isotonic regression produces optimal monotonic binning! The “catch” is that you have absolutely no control over the number of bins or their sizes. Can you acquire control over the number of bins and their sizes and still retain optimality and monotonicity?
Credit Scoring for SAS® Enterprise MinerTM is an addon product aimed specifically at credit scorecard modelers, and its “Interactive Grouping” node has a method called “Constrained Optimized Binning” that appears inspired by isotonic regression, and is designed to attain monotonicity and optimality within additional userspecified bounds. (Note that this node is distinct from the general SAS EM “Interactive Binning” node that does nonmonotonic, nonoptimal binning with userspecified bounds.) But there’s still a catch (besides the additional cost). From the patent application description, the objective is to minimize the sum of absolute differences between individual WoE values and their associated decision variables, which are proxies for the bin WoE. This cannot be done directly with pointwise data, for which WoE is undefined; the data must be preaggregated into “finegrained bins” to an extent that each “finegrained bin” has at least one of both events and one nonevents. Differences in preaggregation can affect the final optimality, but the patent description doesn’t include any details on the preaggregation part.
*********************************************************************************
MY WOEFUL BINNING FORMULATION:
This is a pure BLIP (Binary Linear Integer Program) formulation, unlike the SAS EM method, which is a full MILP (Mixed Integer Linear Program). In this formulation, every possible bin gets its own decision variable. Since bins are just subintervals, fully specified by their two endpoints, this means the number of variables (columns) is O(N**2), where N is the number of data points. This is a big practical disadvantage, although I think it should be less of a disadvantage than it has turned out to be in practice, but I would need to find a more clever formulation.
For the objective function, just compute the appropriate metric, m[i,j], for each bin (i,j), and take the sum over all (i,j) of m[i,j]*v[i,j], where v[i,j] is the corresponding binary decision variable for bin (i,j):
v[i,j] = 1 if (i,j) is chosen as one of the bins,
v[i,j] = 0 if (i,j) is not chosen as one of the bins
Some metric examples are:
Chisquare: m[i,j] = ((N*e[i,j]) – (E*n[i,j]))**2 / (N*e[i,j]*(n[i,j]  e[i,j]))
Information Value: m[i,j] = ((e[i,j] / E) – ((n[i,j]  e[i,j]) / (N  E)))*(log((e[i,j] / E) / ((n[i,j]  e[i,j]) / (N  E))))
Mean Sum of Squares: m[i,j] = (e[i,j]*(n[i,j]  e[i,j]) / (N*n[i,j])
where n[i,j] is the number of points in (i,j), e[i,j] is the number of events in (i,j), N is the total number of points, and E is the total number of events. Chisquare and Information Value should be maximized over the chosen bins, Mean Sum of Squares should be minimized over the chosen bins.
The constraints are:
1. Every point should be in exactly one bin. This can be expressed in different ways, but the number of such constraints is O(N).
2. Upper bound on number of bins, one constraint if desired. Sum of all v[i,j] ≤ upper bound.
3. Lower bound on number of bins, one constraint if desired. Sum of all v[i,j] ≥ lower bound.
4. Upper and / or lower bounds on number of points per bin do not require constraints, just eliminate the decision variables for bins that are outside of the bounds.
5. Event rate monotonicity. At any bin endpoint p, the sum over i of ((e[i,p]*v[i,p]) / n[i,p]) must be ≤ the sum over j of ((e[(p+1),j]*v[(p+1),j]) / n[(p+1),j]) (for increasing monotonicity, reverse for decreasing). You can also make the difference ≥ a positive constant to require neighboring bins to behave differently. The number of such constraints is O(N). On top of the monotonicity constraints, I have experimented with more complex constraints to require event rate differences between neighboring bins to be statistically significant; I needed O(N**2) such constraints in my formulation.
I did some experiments using a wellknown, publicly available “German Credit” data set with 1,000 points. The binary target is called Creditability, with 700 events and 300 nonevents, and the continuous predictor is Credit_Amount, with 923 distinct values ranging from 250 to 18,424. The attached code to read in the data (_import_GRIDWORK.GC_CREDIT_AMOUNT_AGG01IC.sas) gives the results already aggregated to distinct predictor values (nv_predictor_value). The number of points for each distinct predictor value is nv_w_ind_all, and the number of events for each distinct predictor value is nv_w_ind_one. I chose this data set because there was a paper about binning at SGF2018, called “Get Better Weight of Evidence for Scorecards Using a Genetic Algorithm.” Their criteria for good binning is a bit vague, they think the difference of WoE between groups should be as large as possible and look linear. Here is what they came up with, four monotonically decreasing bins:
Obs  i  j  wsize  wones  wzero  wrate  sos  iv  woe  ent  css  diffwrate  sigdifrate  diffwoe 
1  1  631  699  513  186  0.7339  0.1365  0.0189  0.1672  0.6629  0.1237  .  .  . 
2  632  764  139  96  43  0.6907  0.0297  0.0003  0.0442  0.1408  0.0292  0.0433  0.0835  0.2114 
3  765  821  60  37  23  0.6167  0.0142  0.0089  0.3719  0.0654  0.0132  0.0740  0.1451  0.3277 
4  822  923  102  54  48  0.5294  0.0254  0.0604  0.7295  0.1155  0.0254  0.0873  0.1566  0.3576 
Total 

 1,000  700  300 
 0.2058  0.0884 
 0.9845  0.1915 



Note that the minimum bin size (wsize) is 60 points, the smallest absolute event rate difference (diffwrate) between adjacent bins is 0.0433, and the smallest absolute WoE difference (diffwoe) between adjacent bins is 0.2114. To measure the linearity, the authors regressed WoE against Obs and found adjusted R**2 of 0.9814. The overall information value (iv) is 0.0884.
Using the BLIP formulation with the objective of maximizing information value (iv), specifying four bins, with a minimum bin size of 60 and a minimum absolute event rate difference between adjacent bins of 0.08, I found the following (attachment _122_gco099_fullstimer_optmodel_milpsolve_max_iv_clean_4g_60minsz_nomaxsz_0.08mindif_desc_impure_modified_metrics.sas):
Obs  i  j  wsize  wones  wzero  wrate  sos  iv  woe  ent  css  diffwrate  sigdifrate  diffwoe 
1  1  654  724  537  187  0.7417  0.1387  0.0299  0.2076  0.6771  0.1259  .  .  . 
2  655  782  133  87  46  0.6541  0.0301  0.0061  0.2100  0.1404  0.0296  0.0876  0.0869  0.4176 
3  783  862  82  47  35  0.5732  0.0201  0.0274  0.5525  0.0916  0.0191  0.0810  0.1342  0.3425 
4  863  923  61  29  32  0.4754  0.0152  0.0617  0.9457  0.0691  0.0152  0.0978  0.1648  0.3932 
Total 

 1,000  700  300 
 0.2041  0.1250 
 0.9782  0.1897 



Note that the minimum bin size (wsize) is 61 points, the smallest absolute event rate difference (diffwrate) between adjacent bins is 0.0810, and the smallest absolute WoE difference (diffwoe) between adjacent bins is 0.3425. To measure the linearity, regressing WoE against Obs gives adjusted R**2 of 0.9980. The overall information value (iv) is 0.1250. It took much longer to run the optimization than to find the answer by exhaustive enumeration.
Using brute force exhaustive enumeration, it is possible to find a set of four bins whose minimum event rate difference is 0.08573 and minimum WoE difference is 0.3623. Its adjusted R**2 is 0.9991, and its overall information value (iv) is 0.1222. I was able to modify my BLIP formulation to formulate a MILP to find the binning with the maximum possible minimum event rate difference, but the MILP ran out of memory and would not converge on this data.
I also ran the BLIP where I still used the monotonicity constraints, but removed the minimum event rate difference specification and added in the significant difference constraints. If I removed the constraints on the number of bins, it ran out of memory and did not converge. I was able to get it to run to completion with an upper bound of seven bins, no upper or lower limit on the number of points per bin. The solution has three bins (attachment _123_gco100_fullstimer_optmodel_milpsolve_max_iv_clean_17g_60minsz_nomaxsz_0mindif_sigrdif_desc_impure_.sas):
Obs  i  j  wsize  wones  wzero  wrate  sos  iv  woe  ent  css  diffwrate  sigdifrate  diffwoe 
1  1  668  739  550  189  0.7443  0.1407  0.0344  0.2209  0.6878  0.1278  .  .  . 
2  669  848  186  116  70  0.6237  0.0437  0.0231  0.3422  0.2017  0.0422  0.1206  0.0764  0.5631 
3  849  923  75  34  41  0.4533  0.0186  0.0911  1.0345  0.0846  0.0186  0.1703  0.1324  0.6923 
Total 

 1,000  700  300 
 0.2029  0.1487 
 0.9741  0.1886 



I ran many other experiments, but most of the time the optimization would not converge, even when the number of possible solutions was only a few million. When the optimization does converge, it takes a long time to do so. If you look at my optimization code and see some ways to improve it, please post! Thanks!
Hey. That is my paper. the criteria of woe is coming from Naqi's book.
This optimal problem is complicated, so I solved it by GA, not sure if SAS/OR can solve it.
You also could try :
%let var=duration;
%let group=6 ;
%let n_iter=100;
if that line is not linear , reduce the group number:
%let var=duration;
%let group=5 ;
%let n_iter=100;
And so on....
%let var=duration;
%let group=4 ;
%let n_iter=100;
until you get a line .
Hi @Ksharp !
Thank you so much for responding. I enjoyed your paper, you took a very interesting approach, I think there's a lot of potential for GAs. Your solution is quite practical to fill your needs, and probably runs waaaaay faster than the BLIP / MILP formulations. But if you're interested in tinkering further with your algorithm, now you also know that there are four bin solutions for the German Credit data that are more linear, have greater differences between bin WoE values, and higher overall information value than what you found. There are only 130 million possible four bin solutions for the German Credit data, so if you continue to tune your approach, I'll bet you can improve it quickly, if you wish to do so. Good luck!
Hmmmm, let's see if I can correct some of my own mistakes.
Okay, for the last optimization I displayed, the one using the attachment _123_gco100_fullstimer_optmodel_milpsolve_max_iv_clean_17g_60minsz_nomaxsz_0mindif_sigrdif_desc_impure_.sas, I claimed there was no upper or lower limit on the number of points per bin, but there is a lower bound of 60 points per bin. And in both of the optimization programs, it also requires that each bin have at least one event and at least one nonevent; otherwise it would try to choose an allevent bin at the left end, and an allnonevent bin on the right side (although the 60 point minimum size should overcome that problem for this data set).
I think the formulas I gave for information value (iv column in my tables) and mean sum of squares (sos column in my tables) are correct. I didn't include chisquare in the tables, and it looks like I got the formula wrong. I believe the correct chisquare formula is:
m[i,j] = ((N*e[i,j]) – (E*n[i,j]))**2 / (n[i,j]*E*(N  E))
Also, I didn't mention this, but one way I like to do monotonic binning that works for pretty large size data sets without problems is to start with the isotonic regression, and then follow that by running Fisher's dynamic programming algorithm on the results to constrain the number and sizes of the bins. DP can't impose monotonicity, but once the data is already monotonic, like the results from the isotonic regression, it can't break that monotonicity. The final results aren't guaranteed optimal, but they often agree with the optimal solution under the desired conditions. If people are interested, I can post some results and discuss more. Thanks!
Sorry . I do not know OR well. @RobPratt might give you a hand.
According Naqi's book, the good/bad distribution must >= 0.05 . you could see this in my IML code .
/*
proc import datafile='c:\temp\1German Credit.xlsx' dbms=xlsx out=have replace;
run;
*/
%let var=duration;
%let group=6 ;
%let n_iter=100;
data temp;
set have;
keep &var good_bad ;
run;
proc sql noprint;
select sum(good_bad='bad'),sum(good_bad='good'),
floor(min(&var)),ceil(max(&var)) into : n_bad,: n_good,: min,: max
from temp;
quit;
%put &n_bad &n_good &min &max;
proc sort data=temp;by &var ;run;
proc iml;
use temp(where=(&var is not missing));
read all var {&var good_bad};
close;
start function(x) global(bin,&var ,good_bad,group,woe);
if countunique(x)=group1 then do;
col_x=t(x);
call sort(col_x,1);
cutpoints= .M//col_x//.I ;
b=bin(&var ,cutpoints,'right');
if countunique(b)=group then do;
do i=1 to group;
idx=loc(b=i);
temp=good_bad[idx];
n_bad=sum(temp='bad');
n_good=sum(temp='good');
bad_dist=n_bad/&n_bad ;
good_dist=n_good/&n_good ;
if Bad_Dist>0.05 & Good_Dist>0.05 then woe[i]=log(Bad_Dist/Good_Dist);
else woe[i]=.;
end;
if countmiss(woe)=0 then do;
/*
xx=j(group,1,1)woewoe##2;
*/
xx=j(group,1,1)woe;
beta=solve(xx`*xx,xx`*bin);
yhat=xx*beta;
sse=ssq(binyhat);
end;
else sse=999999;
end;
else sse=999999;
end;
else sse=999999;
return (sse);
finish;
group=&group ;
bin=t(1:group);
woe=j(group,1,.);
encoding=j(2,group1,&min );
encoding[2,]=&max ;
id=gasetup(2,group1,123456789);
call gasetobj(id,0,"function");
call gasetsel(id,10,1,1);
call gainit(id,1000,encoding);
niter = &n_iter ;
do i = 1 to niter;
call garegen(id);
call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);
col_mem=t(mem);
call sort(col_mem,1);
cutpoints= .M//col_mem//.I ;
b=bin(&var ,cutpoints,'right');
create cutpoints var {cutpoints};
append;
close;
create group var {b};
append;
close;
print value[l = "Min Value:"] ;
call gaend(id);
quit;
data all_group;
set temp(keep=&var rename=(&var=b) where=(b is missing)) group;
run;
data all;
merge all_group temp;
rename b=group;
run;
title "变量: &var" ;
proc sql;
create table woe_&var as
select group label=' ',
min(&var) as min label='最小值',max(&var) as max label='最大值',count(*) as n label='频数',
calculated n/(select count(*) from all) as per format=percent7.2 label='占比',
sum(good_bad='bad') as n_bad label='bad的个数',sum(good_bad='good') as n_good label='good的个数',
sum(good_bad='bad')/(select sum(good_bad='bad') from all ) as bad_dist label='bad的占比',
sum(good_bad='good')/(select sum(good_bad='good') from all ) as good_dist label='good的占比',
log(calculated Bad_Dist/calculated Good_Dist) as woe
from all
group by group
order by woe;
create index group on woe_&var;
select *,sum( (Bad_DistGood_Dist)*woe ) as iv
from woe_&var ;
quit;
title ' ';
data fmt_&var ;
set cutpoints;
start=lag(cutpoints);
end=cutpoints;
if start=.M then hlo='IL';
if end=.I then hlo='IH';
if _n_ ne 1 then do;group+1;output;end;
run;
data fmt_&var(index=(group));
merge fmt_&var woe_&var(keep=group woe);
by group;
retain fmtname "&var" type 'I';
keep group fmtname type start end woe hlo;
rename woe=label;
label group=' ';
run;
proc format cntlin=fmt_&var library=z;
run;
/*
proc print data=woe_&var noobs label;run;
proc sgplot data=woe_&var;
reg y=group x=woe/degree=2 cli clm jitter;
run;
*/
proc sgplot data=woe_&var noautolegend;
vbar group/response=woe nostatlabel missing;
vline group/response=woe nostatlabel missing markers MARKERATTRS=(symbol=circlefilled
size=12) MARKERFILLATTRS=(color=white) MARKEROUTLINEATTRS=graphdata1
FILLEDOUTLINEDMARKERS;
run;
ods select fitplot;
proc reg data=woe_&var;
model group=woe/ cli clm ;
quit;
proc copy in=work out=z;
select woe_: fmt_: ;
run;
Obs 
i 
j 
wsize 
wones 
wzero 
wrate 
sos 
iv 
woe 
ent 
css 
diffwrate 
sigdifrate 
diffwoe 
1 
1 
654 
724 
537 
187 
0.7417 
0.1387 
0.0299 
0.2076 
0.6771 
0.1259 
. 
. 
. 
2 
655 
782 
133 
87 
46 
0.6541 
0.0301 
0.0061 
0.2100 
0.1404 
0.0296 
0.0876 
0.0869 
0.4176 
3 
783 
862 
82 
47 
35 
0.5732 
0.0201 
0.0274 
0.5525 
0.0916 
0.0191 
0.0810 
0.1342 
0.3425 
4 
863 
923 
61 
29 
32 
0.4754 
0.0152 
0.0617 
0.9457 
0.0691 
0.0152 
0.0978 
0.1648 
0.3932 
Total 


1,000 
700 
300 

0.2041 
0.1250 

0.9782 
0.1897 



Hi, your bad distribution 29 is less than 700*0.05=35 ,
and not satisfy the condition >=0.05 (which is from book:
Siddiqi, Naeem. 2006. Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring
).
But mine satisfy this criteria .
Hi @Ksharp !
Thank you for following up. I modified my optimization program to include your distribution criteria. Here is a four bin solution which meets those requirements:
Obs  i  j  wsize  wones  wzero  wrate  sos  iv  woe  ent  css  diffwrate  sigdifrate  diffwoe 
1  1  650  720  533  187  0.7403  0.1384  0.0276  0.2001  0.6751  0.1256  .  .  . 
2  651  771  125  83  42  0.6640  0.0279  0.0036  0.1661  0.1306  0.0274  0.0763  0.0888  0.3662 
3  772  831  63  37  26  0.5873  0.0153  0.0167  0.4945  0.0699  0.0143  0.0767  0.1471  0.3284 
4  832  923  92  47  45  0.5109  0.0230  0.0666  0.8038  0.1044  0.0230  0.0764  0.1588  0.3093 
Total  1,000  700  300  0.2046  0.1145  0.9800  0.1903 
The smallest bin has 63 points, the smallest WoE difference is 0.3093, the smallest event rate difference is 0.0763, the information value is 0.1145, and the adjusted Rsquare is 0.9978. Thanks!
Hi @Ksharp !
I also reworked the final, significant difference optimization to use your good/bad distribution criteria (attachment _125_gco102_fullstimer_optmodel_milpsolve_max_iv_clean_15g_60minsz_nomaxsz_0mindif_sigrdif_desc_impurenaqi_modified_metrics_.sas). In this case I cut down from seven to five bin upper bound (so that it wouldn't take all day to run), but kept the minimum size at 60. Once again, it chooses a three bin solution:
Obs  i  j  wsize  wones  wzero  wrate  sos  iv  woe  ent  css  diffwrate  sigdifrate  diffwoe 
1  1  668  739  550  189  0.7443  0.1407  0.0344  0.2209  0.6878  0.1278  .  .  . 
2  669  845  183  114  69  0.6230  0.0430  0.0232  0.3452  0.1985  0.0415  0.1213  0.0769  0.5661 
3  846  923  78  36  42  0.4615  0.0194  0.0887  1.0015  0.0881  0.0194  0.1614  0.1310  0.6562 
Total  1,000  700  300  0.2030  0.1463  0.9745  0.1887 
Now the smallest bin has 78 points, the event rate differences are at least 0.1213 and statistically significant at 95% confidence, the WoE differences are at least 0.5661, and the IV is 0.1463. While it's a bit silly to talk about linearity with only three points, the adjusted Rsquare is 0.9964. This seems like a decently robust set of bins. And I think it satisfies all your criteria. Thanks!
What an interesting thing.
As the number of bin go up, the IV should be larger too .
But you got the different thing, interesting .
Hi @Ksharp !
You appear to admire Siddiqi's book, and I must confess I have not read it, but there are two things you cite as having emanated from there that I would love to have explained to me.
First, the idea that the bin WoEs should have a linear trend. I would think one would want to get maximum separation of events and nonevents, which would be actualized by having bins with extreme values of WoE, positive and negative. Linear means you're going to have one or two bins with WoE around zero, which is the same characteristic as the entire population, and thus uninformative. Furthermore, even if you want linearity, why would you treat all the bins equally, as you do in your regression, when they have very different sizes? Why wouldn't you weight them by size?
My second question is about the good / bad distribution 5% lower bound rule. I can understand having a lower bound on the size of each whole bin, you want to ensure a chosen bin is not just a random fluctuation. But, especially at the extreme ends, wouldn't you like to have a bin at one end that is heavily skewed to events, and a bin at the other end that leans mightily toward nonevents? If the bins are large enough overall, why do you care that they have at least 5% of both categories? That's why I like the statistical significance condition, it pretty much bakes in the requirement of sufficient size without having to set it explicitly.
I would love to get your thoughts, or anyone else's, about these issues.
Thanks!
Hi Top,
For your first Q, I just check if woe is linear , not really do a Regression Model or Weight Regression .
For your second Q, I have no idea about it, maybe Siddiqi could answer it .
Hi @Ksharp !
Thank you for responding. I asked the linearity question, because I saw the following in your paper:
Then I ran the same regression on the examples I generated. But are you saying that you normally don't run the regression, you just look for linearity by eye? Either way, what I'm really wondering is whether a "linear" solution, in which some of the bins would have WoE close to zero, is somehow preferable to a solution that does not look linear, but has all bins with either very positive or very negative WoE. It seems to me that bins with WoE near zero don't do any better than random guessing for separating events from nonevents.
Hi Top,
"But are you saying that you normally don't run the regression, you just look for linearity by eye? "
Yeah.You could check it linear by eyes ,no need do PROC REG.
"Either way, what I'm really wondering is whether a "linear" solution, in which some of the bins would have WoE close to zero, is somehow preferable to a solution that does not look linear, but has all bins with either very positive or very negative WoE. "
No. must be look like linear due to the assumption of Logistic Model, which is under GLM framework , all these LINEAR model better have linear relation between Y and X, that is why to bin to get linear WOE . you could bin it with U or reverse U ,but that is not easy to explain.
"It seems to me that bins with WoE near zero don't do any better than random guessing for separating events from nonevents."
"WoE near zero" don't have any predicted power,but that is pay to build score card model. you can't get away from it .
Thanks, That is just my opinion.
Hope SAS employee Sidd (the author of the book) could appear here and say something .
Hi @Ksharp !
Thank you for continuing the conversation. From your comments, "all these LINEAR model better have linear relation between Y and X," we are certainly in agreement there. And, of course, in the case of logistic regression, that linear relationship is specifically between the log odds of Y and the linear predictor X. For binning, the linear predictor X is either: the original variable transformed to the bin WoE values (as a single DoF), or the set of indicator functions of the individual bins (for multiple DoF). But that linear relationship is not the same as plotting the bin WoE values in sequential order and expecting the result to look like a line. Although you say you normally look for this linearity by eye, in your code you actually regressed bin sequence number against WoE value; I think that may be a spurious relationship. Does Siddiqi claim that the sequence of bin WoE values should look like a line? If so, where in the book is that claim?
“But that linear relationship is not the same as plotting the bin WoE values in sequential order and expecting the result to look like a line. ”
Yeah. I know the reason ,it is because Link function. But it must be monotonic relationship. And I also prefer linear relationship like I said before due to logistic model is still a GLM .
"Although you say you normally look for this linearity by eye, in your code you actually regressed bin sequence number against WoE value; I think that may be a spurious relationship. "
I don't think so. I do regression model to check linear of woe, And in IML code doing regression model is trying to make woe linear.
and plot woe to see if it is linear and it is .
You also could use 10 20 30 to do regression, but that must have same step to make woe linear and larger between each other.
"Does Siddiqi claim that the sequence of bin WoE values should look like a line? If so, where in the book is that claim?"
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.