Yes, I am trying to use PROC OPTMODEL in SAS 9.4 on a Linux grid to do monotonic supervised optimal binning of an ordinal predictor variable with a binary target (although continuous targets can be used, too). I have an implementation of a (seemingly) correct formulation, but so far it is consistently outperformed by pure brute force exhaustive enumeration, so I thought this would be a good opportunity to appeal to the omniscience of the SAS® community hive mind to see whether any concrete improvements can be found. ********************************************************************************* First some BACKGROUND: (Apologies for the length, you can skip this and go to MY FORMULATION below if you just want to get to the heart of the problem.) For those unfamiliar with the term, binning of an interval variable entails partitioning its range into an exhaustive, disjoint, discrete collection of subintervals. For example, if the range of x is [0, 10], then one possible set of three bins would be x1 = [0, 2.8], x2 = (2.8, 6.3], x3 = (6.3, 10]. Binning is also referred to as bucketing, classing, discretizing, grouping, or partitioning. The two most common forms of unsupervised binning are equal width and equal frequency (based on a data sample). An equal width example for the x variable above would be x1 = [0, 2.5], x2 = (2.5, 5], x3 = (5, 7.5], x4 = (7.5, 10]. There are other kinds of unsupervised binning methods, too. SAS/STAT has PROC HPBIN to do efficient unsupervised binning. In supervised binning, the bins are chosen to magnify the relationship between the variable under consideration and a target variable. One of the very first such binning algorithms was described by Walter Fisher in the Journal of the American Statistical Association in 1958, “On Grouping for Maximum Homogeneity,” as an attempt to minimize the within group variances of an interval target. Many more discretization algorithms have been devised; a useful summary can be found in the 2013 IEEE Transactions on Knowledge and Data Engineering article, “A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning.” Some of these techniques have been referred to as “optimal binning,” although few of them are optimal in the strict sense of guaranteeing the maximum or minimum value of a computable objective function. The Transform Variables node of SAS® Enterprise MinerTM has an “optimal binning transformation” that uses (non-optimal) decision trees. Fisher used a truly optimal dynamic programming procedure that has been rediscovered and published in prestigious journals a handful of times over the years. A frequent contributor to this forum, @RobPratt, joined this exalted company of “Fisher propagators” in 2016 when he cobbled together a version of the dynamic programming binning scheme as a response to the communities.sas.com thread “Finding the optimal cut-off for each bucket.” (There are a couple of R packages with implementations of Fisher’s algorithm: classInt and cartography; also the Excel add-in xlstat) Simply put, the main principle of the dynamic programming algorithm is that if you have the best set of bins over the entire range, then the subset of bins over any sub-range is the best set of bins for that sub-range; otherwise, you could swap in a better set of bins for the sub-range and thereby improve the binning of the whole range. This allows the algorithm to build the set of bins like a proof by induction: if you have the best binnings of points 1,…,k for every k ≤ n, then you can construct the best binning for 1,…,n+1. There is a constituency for binning in the credit scorecard modeling arena who have a particular requirement for the bins they produce: monotonicity of target response. For binary event predictive models, this is usually defined in terms of the Weight of Evidence (WoE), which, within each bin, is just the log odds of the target variable for the bin minus the log odds of the entire data set, i.e.: log(# bin events / # bin non-events) - log(# total events / # total non-events) If the bins are numbered consecutively from left (smallest predictor value) to right (largest predictor value), then increasing (decreasing, resp.) monotonicity means that the WoE of bin j is ≤ (≥, resp.) the WoE of bin (j+1). Note that monotonicity of WoE and monotonicity of event rate are exactly equivalent whenever WoE is defined; event rate monotonicity also can include bins with all events or all non-events, although WoE will be undefined for such bins. But if you require monotonicity, you can see that the dynamic programming scheme won’t work: if you obtain the best monotonic binning for points 1,…,n, there’s no guarantee you can extend it monotonically to (n+1) and beyond. Unfortunately, the long list of binning algorithms in the IEEE article, optimal or not, won’t help you out; there’s not even a passing mention of the monotonicity constraint. What to do? If we just concentrate on achieving monotonicity, we can use isotonic regression, for which, if x[i] are the predictor values and y[i] are the response values, 1 ≤ i ≤ N, we attempt to find a transformation f(x[i]) that minimizes the sum from 1 to N of (y[i] – f(x[i]))**2, where f(x[i]) ≤ f(x[i+1]) (or f(x[i]) ≥ f(x[i+1]), resp) for increasing (decreasing, resp.) monotonicity. This transformation is optimal in the least sum of squares sense by design, and there are reasonably efficient algorithms to compute it. If the response variable, y[i], is already correspondingly monotonic, then f(x[i]) = y[i], and the transformation is perfect replication. But for most binary response variables, this will not be the case. In fact, as Wensui Liu has demonstrated in his wonderful blog on statistical computing, cited in the communities.sas.com thread “Optimal monotonic binning,” the isotonic transform will consist of piecewise constant subintervals, and in each subinterval the values of f(x[i]) will equal the average event rate over that subinterval; in other words, bins. So, isotonic regression produces optimal monotonic binning! The “catch” is that you have absolutely no control over the number of bins or their sizes. Can you acquire control over the number of bins and their sizes and still retain optimality and monotonicity? Credit Scoring for SAS® Enterprise MinerTM is an add-on product aimed specifically at credit scorecard modelers, and its “Interactive Grouping” node has a method called “Constrained Optimized Binning” that appears inspired by isotonic regression, and is designed to attain monotonicity and optimality within additional user-specified bounds. (Note that this node is distinct from the general SAS EM “Interactive Binning” node that does non-monotonic, non-optimal binning with user-specified bounds.) But there’s still a catch (besides the additional cost). From the patent application description, the objective is to minimize the sum of absolute differences between individual WoE values and their associated decision variables, which are proxies for the bin WoE. This cannot be done directly with pointwise data, for which WoE is undefined; the data must be pre-aggregated into “fine-grained bins” to an extent that each “fine-grained bin” has at least one of both events and one non-events. Differences in pre-aggregation can affect the final optimality, but the patent description doesn’t include any details on the pre-aggregation part. ********************************************************************************* MY WOEFUL BINNING FORMULATION: This is a pure BLIP (Binary Linear Integer Program) formulation, unlike the SAS EM method, which is a full MILP (Mixed Integer Linear Program). In this formulation, every possible bin gets its own decision variable. Since bins are just sub-intervals, fully specified by their two endpoints, this means the number of variables (columns) is O(N**2), where N is the number of data points. This is a big practical disadvantage, although I think it should be less of a disadvantage than it has turned out to be in practice, but I would need to find a more clever formulation. For the objective function, just compute the appropriate metric, m[i,j], for each bin (i,j), and take the sum over all (i,j) of m[i,j]*v[i,j], where v[i,j] is the corresponding binary decision variable for bin (i,j): v[i,j] = 1 if (i,j) is chosen as one of the bins, v[i,j] = 0 if (i,j) is not chosen as one of the bins Some metric examples are: Chi-square: m[i,j] = ((N*e[i,j]) – (E*n[i,j]))**2 / (N*e[i,j]*(n[i,j] - e[i,j])) Information Value: m[i,j] = ((e[i,j] / E) – ((n[i,j] - e[i,j]) / (N - E)))*(log((e[i,j] / E) / ((n[i,j] - e[i,j]) / (N - E)))) Mean Sum of Squares: m[i,j] = (e[i,j]*(n[i,j] - e[i,j]) / (N*n[i,j]) where n[i,j] is the number of points in (i,j), e[i,j] is the number of events in (i,j), N is the total number of points, and E is the total number of events. Chi-square and Information Value should be maximized over the chosen bins, Mean Sum of Squares should be minimized over the chosen bins. The constraints are: 1. Every point should be in exactly one bin. This can be expressed in different ways, but the number of such constraints is O(N). 2. Upper bound on number of bins, one constraint if desired. Sum of all v[i,j] ≤ upper bound. 3. Lower bound on number of bins, one constraint if desired. Sum of all v[i,j] ≥ lower bound. 4. Upper and / or lower bounds on number of points per bin do not require constraints, just eliminate the decision variables for bins that are outside of the bounds. 5. Event rate monotonicity. At any bin endpoint p, the sum over i of ((e[i,p]*v[i,p]) / n[i,p]) must be ≤ the sum over j of ((e[(p+1),j]*v[(p+1),j]) / n[(p+1),j]) (for increasing monotonicity, reverse for decreasing). You can also make the difference ≥ a positive constant to require neighboring bins to behave differently. The number of such constraints is O(N). On top of the monotonicity constraints, I have experimented with more complex constraints to require event rate differences between neighboring bins to be statistically significant; I needed O(N**2) such constraints in my formulation. I did some experiments using a well-known, publicly available “German Credit” data set with 1,000 points. The binary target is called Creditability, with 700 events and 300 non-events, and the continuous predictor is Credit_Amount, with 923 distinct values ranging from 250 to 18,424. The attached code to read in the data (_import_GRIDWORK.GC_CREDIT_AMOUNT_AGG01IC.sas) gives the results already aggregated to distinct predictor values (nv_predictor_value). The number of points for each distinct predictor value is nv_w_ind_all, and the number of events for each distinct predictor value is nv_w_ind_one. I chose this data set because there was a paper about binning at SGF2018, called “Get Better Weight of Evidence for Scorecards Using a Genetic Algorithm.” Their criteria for good binning is a bit vague, they think the difference of WoE between groups should be as large as possible and look linear. Here is what they came up with, four monotonically decreasing bins: Obs i j wsize wones wzero wrate sos iv woe ent css diffwrate sigdifrate diffwoe 1 1 631 699 513 186 0.7339 0.1365 0.0189 0.1672 0.6629 0.1237 . . . 2 632 764 139 96 43 0.6907 0.0297 0.0003 -0.0442 0.1408 0.0292 -0.0433 0.0835 -0.2114 3 765 821 60 37 23 0.6167 0.0142 0.0089 -0.3719 0.0654 0.0132 -0.0740 0.1451 -0.3277 4 822 923 102 54 48 0.5294 0.0254 0.0604 -0.7295 0.1155 0.0254 -0.0873 0.1566 -0.3576 Total 1,000 700 300 0.2058 0.0884 0.9845 0.1915 Note that the minimum bin size (wsize) is 60 points, the smallest absolute event rate difference (diffwrate) between adjacent bins is 0.0433, and the smallest absolute WoE difference (diffwoe) between adjacent bins is 0.2114. To measure the linearity, the authors regressed WoE against Obs and found adjusted R**2 of 0.9814. The overall information value (iv) is 0.0884. Using the BLIP formulation with the objective of maximizing information value (iv), specifying four bins, with a minimum bin size of 60 and a minimum absolute event rate difference between adjacent bins of 0.08, I found the following (attachment _122_gco099_fullstimer_optmodel_milpsolve_max_iv_clean_4-g_60-minsz_no-maxsz_0.08-mindif_desc_impure_modified_metrics.sas): Obs i j wsize wones wzero wrate sos iv woe ent css diffwrate sigdifrate diffwoe 1 1 654 724 537 187 0.7417 0.1387 0.0299 0.2076 0.6771 0.1259 . . . 2 655 782 133 87 46 0.6541 0.0301 0.0061 -0.2100 0.1404 0.0296 -0.0876 0.0869 -0.4176 3 783 862 82 47 35 0.5732 0.0201 0.0274 -0.5525 0.0916 0.0191 -0.0810 0.1342 -0.3425 4 863 923 61 29 32 0.4754 0.0152 0.0617 -0.9457 0.0691 0.0152 -0.0978 0.1648 -0.3932 Total 1,000 700 300 0.2041 0.1250 0.9782 0.1897 Note that the minimum bin size (wsize) is 61 points, the smallest absolute event rate difference (diffwrate) between adjacent bins is 0.0810, and the smallest absolute WoE difference (diffwoe) between adjacent bins is 0.3425. To measure the linearity, regressing WoE against Obs gives adjusted R**2 of 0.9980. The overall information value (iv) is 0.1250. It took much longer to run the optimization than to find the answer by exhaustive enumeration. Using brute force exhaustive enumeration, it is possible to find a set of four bins whose minimum event rate difference is 0.08573 and minimum WoE difference is 0.3623. Its adjusted R**2 is 0.9991, and its overall information value (iv) is 0.1222. I was able to modify my BLIP formulation to formulate a MILP to find the binning with the maximum possible minimum event rate difference, but the MILP ran out of memory and would not converge on this data. I also ran the BLIP where I still used the monotonicity constraints, but removed the minimum event rate difference specification and added in the significant difference constraints. If I removed the constraints on the number of bins, it ran out of memory and did not converge. I was able to get it to run to completion with an upper bound of seven bins, no upper or lower limit on the number of points per bin. The solution has three bins (attachment _123_gco100_fullstimer_optmodel_milpsolve_max_iv_clean_1-7-g_60-minsz_no-maxsz_0-mindif_sigrdif_desc_impure_.sas): Obs i j wsize wones wzero wrate sos iv woe ent css diffwrate sigdifrate diffwoe 1 1 668 739 550 189 0.7443 0.1407 0.0344 0.2209 0.6878 0.1278 . . . 2 669 848 186 116 70 0.6237 0.0437 0.0231 -0.3422 0.2017 0.0422 -0.1206 0.0764 -0.5631 3 849 923 75 34 41 0.4533 0.0186 0.0911 -1.0345 0.0846 0.0186 -0.1703 0.1324 -0.6923 Total 1,000 700 300 0.2029 0.1487 0.9741 0.1886 I ran many other experiments, but most of the time the optimization would not converge, even when the number of possible solutions was only a few million. When the optimization does converge, it takes a long time to do so. If you look at my optimization code and see some ways to improve it, please post! Thanks!
... View more