About Top_Katz

Top_Katz · ‎04-30-2019

Hi @ballardw ! Thank you for responding. My strata variable, binvar, has two levels. The original data set has 1,000 records, 300 with binvar=0 and 700 with binvar=1. Suppose I want to create three sample groups. If I do: PROC SURVEYSELECT DATA=INDS OUT=OUTDS GROUPS=3; STRATA binvar; RUN; then I get two groups with 100 binvar=0 and 233 binvar=1, and a third group with 100 binvar=0 and 234 binvar=1. It tries to make the groups of equal size, and it maintains the overall strata variable proportionality within each group. But suppose I want three groups: group one with 500 total records, 150 binvar=0 and 350 binvar=1 group two with 300 total records, 90 binvar=0 and 210 binvar=1 group three with 200 total records, 60 binvar=0 and 140 binvar=1 Can I do that with one call to PROC SURVEYSELECT? Or do I have to do successive calls? For example, I could do: PROC SURVEYSELECT DATA=INDS OUT=OUTDS1 GROUPS=2; STRATA binvar; RUN; PROC SURVEYSELECT DATA=OUTDS1 (WHERE=(GroupID=2)) OUT=OUTDS2 N=(90,21) OUTALL; STRATA binvar; RUN; DATA OUTDS; SET OUTDS1 (WHERE=(GroupID=1)) OUTDS2; GroupID=IFN((Selected=0),3,GroupID); RUN; This gives me the three groups I sought, but it's clunky. Is it possible to do all this with one PROC SURVEYSELECT call?

Top_Katz · ‎04-30-2019

Hi @Ksharp ! Thank you for responding. I don't think you're likely to see a trade-off like in your example WoE: -0.1, 0, 0.1, 0.8 OR -0.1, 0.2, 0.5, 0.8. Even if it were possible for the same data set to produce those two results (and I don't think it is possible, but I don't have a proof), the IV of the second binning is higher than the IV of the first binning, so there'd be no incentive to choose the first binning. But suppose you saw the following trade-off: WoE1: -0.6, -0.2, 0.2, 0.6 WoE2: -0.6, -0.4, 0.4, 0.6 The first one is linear, the second one isn't. But I think you'll find the second one is superior in every other aspect: higher IV, higher log-likelihood, higher chi-square, lower sum of squares, lower entropy, etc. This is the kind of trade-off I think Siddiqi was describing. The linearity is very appealing visually, but you sacrifice on the fit. And the visual non-linearity of the second binning has nothing to do with its GLM properties; it will provide a better logistic regression fit than the first binning.

Top_Katz · ‎04-30-2019

Thank you @Watts ! That does the trick. Thank you also to @Reeza and @PGStats for contributing. But the method of choosing one group and then doing OUTALL to get the rest to come along only works for two unequal sized stratified groups. I'm curious about whether there is a way to get more than two unequal sized groups while maintaining the strata proportions.

Top_Katz · ‎04-29-2019

Hi! I have a very simple sampling request. I want to randomly partition a dataset into two groups, stratified on a binary variable. The original data has 1000 observations, 300 with binvar=0, 700 with binvar=1. If I do: PROC SURVEYSELECT DATA=INDS OUT=OUTDS GROUPS=2; STRATA binvar; RUN; I get two groups, identified in the output data set by the variable GroupID. Each group has 500 observations, 150 with binvar=0 and 350 with binvar=1. Now, how do I get the groups to be of different sizes but still stratified? I want one group of size 600, with 180 binvar=0 and 420 binvar=1, and a second group of size 400, with 120 binvar=0 and 280 binvar=1. If I do: PROC SURVEYSELECT DATA=INDS OUT=OUTDS GROUPS=(600,400); RUN; I get the two right size groups, but not the exact binvar counts in each one. If I do: PROC SURVEYSELECT DATA=INDS OUT=OUTDS GROUPS=(600,400); STRATA binvar; RUN; I get an error message: ERROR: The sum of the GROUPS= values must equal the total number of units. NOTE: The above message was for the following stratum: binvar=0. ERROR: The sum of the GROUPS= values must equal the total number of units. NOTE: The above message was for the following stratum: binvar=1. Can PROC SURVEYSELECT do what I want? Thanks!

Top_Katz · ‎04-29-2019

Hi @Ksharp ! Maybe I can illustrate this another way. Not all quantitative relationships are monotonic. For example, in marketing, it is often found that likelihood to respond increases with the number of contacts up to a point, after which more contact is associated with lower likelihood to respond (too much contact actually annoys people, making them less likely to respond). Linear models require monotonic relationships between predictors and targets, so regression modelers transform their original predictors to create linear relationships; binning is one way to do that. So, in this marketing scenario, if you bin the number of contacts according to the log odds of response, you'll see the bin WoEs rise and then fall, so they won't have a linear appearance. But you can use the bins as a linear predictor in a logistic regression; the binning has linearized the originally non-linear relationship between the response target and the number-of-contacts predictor. So, you have the GLM linear relationship, but not a linear graph by bin sequence number. And the fact is, even for monotonic responses, the bins will give you a linear relationship between predictor and target, whatever monotonic shape the sequence of bin WoE values resembles.

Top_Katz · ‎04-29-2019

Hi @Ksharp ! No, what I'm saying is that the process of binning tries to create the linear relationship for the GLM assumptions, but that the appearance of linearity by bin sequence number is irrelevant to the GLM assumptions, which concern the predictor and the target, not the predictor and the bin sequence number. For a logistic regression where the only predictors are either the bin-transformed WoE variable, or the set of bin indicator functions, a better binning by IV is likely to give you a better fit; in fact, if instead of maximizing IV as your binning metric, you use entropy minimization, that's equivalent to logistic regression maximum likelihood estimation, so it certainly will give you a better fit (and in my experience, maximum IV bins are typically the same as minimum entropy bins, although I think they can disagree). Not surprisingly, multiple regression fits can be more complicated, and the best univariate binning may not be the best contributor to a multivariate model. (I'm ignoring the fact, for now, that many statisticians disapprove of using binned variables as regression covariates, mainly because they naturally tend to violate some GLM assumptions.)

Top_Katz · ‎04-29-2019

Hi @Ksharp ! Thank you for following up. It looks like Siddiqi encourages using grouping to find what he calls "logical relationships" which are basically linear trends in the sequence, and acknowledges that you may sacrifice information value to establish that relationship: "The process of arriving at a logical trend is one of trial and error, in which one balances the creation of logical trends while maintaining a sufficient IV value." I can see the attraction of using such trend groupings, as long as they don't detract from predictive power, because of their explanatory ability. But I still differ with you about whether a non-linear appearance violates the GLM assumption. The GLM assumption is about the relationship between the predictor and the (log odds of the) target; binning is designed specifically to try to establish that relationship, which might not exist between the target and the original predictor variable. The linear appearance, however, is a relationship between the predictor and the bin sequence number, and is not relevant to the GLM assumption.

Top_Katz · ‎04-28-2019

Hi @Ksharp ! Thank you for continuing the conversation. From your comments, "all these LINEAR model better have linear relation between Y and X," we are certainly in agreement there. And, of course, in the case of logistic regression, that linear relationship is specifically between the log odds of Y and the linear predictor X. For binning, the linear predictor X is either: the original variable transformed to the bin WoE values (as a single DoF), or the set of indicator functions of the individual bins (for multiple DoF). But that linear relationship is not the same as plotting the bin WoE values in sequential order and expecting the result to look like a line. Although you say you normally look for this linearity by eye, in your code you actually regressed bin sequence number against WoE value; I think that may be a spurious relationship. Does Siddiqi claim that the sequence of bin WoE values should look like a line? If so, where in the book is that claim?

Top_Katz · ‎04-26-2019

Hi @Ksharp ! Thank you for responding. I asked the linearity question, because I saw the following in your paper: (page 1) "3) For the continuous variable (e.g. age), WOE should be monotonous increase or decrease, better is linear." (page 2) "For the continuous variable, since its WOE must be monotonous increase or decrease, so I fit a linear regression model, take WOE as x variable, group number (1 2 3 4 …) as y variable" (page 4) "proc reg data=woe_&var; model group=woe/ cli clm ; quit;" Then I ran the same regression on the examples I generated. But are you saying that you normally don't run the regression, you just look for linearity by eye? Either way, what I'm really wondering is whether a "linear" solution, in which some of the bins would have WoE close to zero, is somehow preferable to a solution that does not look linear, but has all bins with either very positive or very negative WoE. It seems to me that bins with WoE near zero don't do any better than random guessing for separating events from non-events.

Top_Katz · ‎04-26-2019

Hi @Ksharp ! You appear to admire Siddiqi's book, and I must confess I have not read it, but there are two things you cite as having emanated from there that I would love to have explained to me. First, the idea that the bin WoEs should have a linear trend. I would think one would want to get maximum separation of events and non-events, which would be actualized by having bins with extreme values of WoE, positive and negative. Linear means you're going to have one or two bins with WoE around zero, which is the same characteristic as the entire population, and thus uninformative. Furthermore, even if you want linearity, why would you treat all the bins equally, as you do in your regression, when they have very different sizes? Why wouldn't you weight them by size? My second question is about the good / bad distribution 5% lower bound rule. I can understand having a lower bound on the size of each whole bin, you want to ensure a chosen bin is not just a random fluctuation. But, especially at the extreme ends, wouldn't you like to have a bin at one end that is heavily skewed to events, and a bin at the other end that leans mightily toward non-events? If the bins are large enough overall, why do you care that they have at least 5% of both categories? That's why I like the statistical significance condition, it pretty much bakes in the requirement of sufficient size without having to set it explicitly. I would love to get your thoughts, or anyone else's, about these issues. Thanks!

Top_Katz · ‎04-25-2019

Hi @Ksharp ! I also reworked the final, significant difference optimization to use your good/bad distribution criteria (attachment _125_gco102_fullstimer_optmodel_milpsolve_max_iv_clean_1-5-g_60-minsz_no-maxsz_0-mindif_sigrdif_desc_impure-naqi_modified_metrics_.sas). In this case I cut down from seven to five bin upper bound (so that it wouldn't take all day to run), but kept the minimum size at 60. Once again, it chooses a three bin solution: Obs i j wsize wones wzero wrate sos iv woe ent css diffwrate sigdifrate diffwoe 1 1 668 739 550 189 0.7443 0.1407 0.0344 0.2209 0.6878 0.1278 . . . 2 669 845 183 114 69 0.6230 0.0430 0.0232 -0.3452 0.1985 0.0415 -0.1213 0.0769 -0.5661 3 846 923 78 36 42 0.4615 0.0194 0.0887 -1.0015 0.0881 0.0194 -0.1614 0.1310 -0.6562 Total 1,000 700 300 0.2030 0.1463 0.9745 0.1887 Now the smallest bin has 78 points, the event rate differences are at least 0.1213 and statistically significant at 95% confidence, the WoE differences are at least 0.5661, and the IV is 0.1463. While it's a bit silly to talk about linearity with only three points, the adjusted R-square is 0.9964. This seems like a decently robust set of bins. And I think it satisfies all your criteria. Thanks!

Top_Katz · ‎04-25-2019

Hi @Ksharp ! Thank you for following up. I modified my optimization program to include your distribution criteria. Here is a four bin solution which meets those requirements: Obs i j wsize wones wzero wrate sos iv woe ent css diffwrate sigdifrate diffwoe 1 1 650 720 533 187 0.7403 0.1384 0.0276 0.2001 0.6751 0.1256 . . . 2 651 771 125 83 42 0.6640 0.0279 0.0036 -0.1661 0.1306 0.0274 -0.0763 0.0888 -0.3662 3 772 831 63 37 26 0.5873 0.0153 0.0167 -0.4945 0.0699 0.0143 -0.0767 0.1471 -0.3284 4 832 923 92 47 45 0.5109 0.0230 0.0666 -0.8038 0.1044 0.0230 -0.0764 0.1588 -0.3093 Total 1,000 700 300 0.2046 0.1145 0.9800 0.1903 The smallest bin has 63 points, the smallest WoE difference is 0.3093, the smallest event rate difference is 0.0763, the information value is 0.1145, and the adjusted R-square is 0.9978. Thanks!

Top_Katz · ‎04-25-2019

Hi @Ksharp ! Thank you so much for responding. I enjoyed your paper, you took a very interesting approach, I think there's a lot of potential for GAs. Your solution is quite practical to fill your needs, and probably runs waaaaay faster than the BLIP / MILP formulations. But if you're interested in tinkering further with your algorithm, now you also know that there are four bin solutions for the German Credit data that are more linear, have greater differences between bin WoE values, and higher overall information value than what you found. There are only 130 million possible four bin solutions for the German Credit data, so if you continue to tune your approach, I'll bet you can improve it quickly, if you wish to do so. Good luck!

Top_Katz · ‎04-25-2019

Hmmmm, let's see if I can correct some of my own mistakes. Okay, for the last optimization I displayed, the one using the attachment _123_gco100_fullstimer_optmodel_milpsolve_max_iv_clean_1-7-g_60-minsz_no-maxsz_0-mindif_sigrdif_desc_impure_.sas, I claimed there was no upper or lower limit on the number of points per bin, but there is a lower bound of 60 points per bin. And in both of the optimization programs, it also requires that each bin have at least one event and at least one non-event; otherwise it would try to choose an all-event bin at the left end, and an all-non-event bin on the right side (although the 60 point minimum size should overcome that problem for this data set). I think the formulas I gave for information value (iv column in my tables) and mean sum of squares (sos column in my tables) are correct. I didn't include chi-square in the tables, and it looks like I got the formula wrong. I believe the correct chi-square formula is: m[i,j] = ((N*e[i,j]) – (E*n[i,j]))**2 / (n[i,j]*E*(N - E)) Also, I didn't mention this, but one way I like to do monotonic binning that works for pretty large size data sets without problems is to start with the isotonic regression, and then follow that by running Fisher's dynamic programming algorithm on the results to constrain the number and sizes of the bins. DP can't impose monotonicity, but once the data is already monotonic, like the results from the isotonic regression, it can't break that monotonicity. The final results aren't guaranteed optimal, but they often agree with the optimal solution under the desired conditions. If people are interested, I can post some results and discuss more. Thanks!

Top_Katz · ‎04-24-2019

Yes, I am trying to use PROC OPTMODEL in SAS 9.4 on a Linux grid to do monotonic supervised optimal binning of an ordinal predictor variable with a binary target (although continuous targets can be used, too). I have an implementation of a (seemingly) correct formulation, but so far it is consistently outperformed by pure brute force exhaustive enumeration, so I thought this would be a good opportunity to appeal to the omniscience of the SAS® community hive mind to see whether any concrete improvements can be found. ********************************************************************************* First some BACKGROUND: (Apologies for the length, you can skip this and go to MY FORMULATION below if you just want to get to the heart of the problem.) For those unfamiliar with the term, binning of an interval variable entails partitioning its range into an exhaustive, disjoint, discrete collection of subintervals. For example, if the range of x is [0, 10], then one possible set of three bins would be x1 = [0, 2.8], x2 = (2.8, 6.3], x3 = (6.3, 10]. Binning is also referred to as bucketing, classing, discretizing, grouping, or partitioning. The two most common forms of unsupervised binning are equal width and equal frequency (based on a data sample). An equal width example for the x variable above would be x1 = [0, 2.5], x2 = (2.5, 5], x3 = (5, 7.5], x4 = (7.5, 10]. There are other kinds of unsupervised binning methods, too. SAS/STAT has PROC HPBIN to do efficient unsupervised binning. In supervised binning, the bins are chosen to magnify the relationship between the variable under consideration and a target variable. One of the very first such binning algorithms was described by Walter Fisher in the Journal of the American Statistical Association in 1958, “On Grouping for Maximum Homogeneity,” as an attempt to minimize the within group variances of an interval target. Many more discretization algorithms have been devised; a useful summary can be found in the 2013 IEEE Transactions on Knowledge and Data Engineering article, “A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning.” Some of these techniques have been referred to as “optimal binning,” although few of them are optimal in the strict sense of guaranteeing the maximum or minimum value of a computable objective function. The Transform Variables node of SAS® Enterprise MinerTM has an “optimal binning transformation” that uses (non-optimal) decision trees. Fisher used a truly optimal dynamic programming procedure that has been rediscovered and published in prestigious journals a handful of times over the years. A frequent contributor to this forum, @RobPratt, joined this exalted company of “Fisher propagators” in 2016 when he cobbled together a version of the dynamic programming binning scheme as a response to the communities.sas.com thread “Finding the optimal cut-off for each bucket.” (There are a couple of R packages with implementations of Fisher’s algorithm: classInt and cartography; also the Excel add-in xlstat) Simply put, the main principle of the dynamic programming algorithm is that if you have the best set of bins over the entire range, then the subset of bins over any sub-range is the best set of bins for that sub-range; otherwise, you could swap in a better set of bins for the sub-range and thereby improve the binning of the whole range. This allows the algorithm to build the set of bins like a proof by induction: if you have the best binnings of points 1,…,k for every k ≤ n, then you can construct the best binning for 1,…,n+1. There is a constituency for binning in the credit scorecard modeling arena who have a particular requirement for the bins they produce: monotonicity of target response. For binary event predictive models, this is usually defined in terms of the Weight of Evidence (WoE), which, within each bin, is just the log odds of the target variable for the bin minus the log odds of the entire data set, i.e.: log(# bin events / # bin non-events) - log(# total events / # total non-events) If the bins are numbered consecutively from left (smallest predictor value) to right (largest predictor value), then increasing (decreasing, resp.) monotonicity means that the WoE of bin j is ≤ (≥, resp.) the WoE of bin (j+1). Note that monotonicity of WoE and monotonicity of event rate are exactly equivalent whenever WoE is defined; event rate monotonicity also can include bins with all events or all non-events, although WoE will be undefined for such bins. But if you require monotonicity, you can see that the dynamic programming scheme won’t work: if you obtain the best monotonic binning for points 1,…,n, there’s no guarantee you can extend it monotonically to (n+1) and beyond. Unfortunately, the long list of binning algorithms in the IEEE article, optimal or not, won’t help you out; there’s not even a passing mention of the monotonicity constraint. What to do? If we just concentrate on achieving monotonicity, we can use isotonic regression, for which, if x[i] are the predictor values and y[i] are the response values, 1 ≤ i ≤ N, we attempt to find a transformation f(x[i]) that minimizes the sum from 1 to N of (y[i] – f(x[i]))**2, where f(x[i]) ≤ f(x[i+1]) (or f(x[i]) ≥ f(x[i+1]), resp) for increasing (decreasing, resp.) monotonicity. This transformation is optimal in the least sum of squares sense by design, and there are reasonably efficient algorithms to compute it. If the response variable, y[i], is already correspondingly monotonic, then f(x[i]) = y[i], and the transformation is perfect replication. But for most binary response variables, this will not be the case. In fact, as Wensui Liu has demonstrated in his wonderful blog on statistical computing, cited in the communities.sas.com thread “Optimal monotonic binning,” the isotonic transform will consist of piecewise constant subintervals, and in each subinterval the values of f(x[i]) will equal the average event rate over that subinterval; in other words, bins. So, isotonic regression produces optimal monotonic binning! The “catch” is that you have absolutely no control over the number of bins or their sizes. Can you acquire control over the number of bins and their sizes and still retain optimality and monotonicity? Credit Scoring for SAS® Enterprise MinerTM is an add-on product aimed specifically at credit scorecard modelers, and its “Interactive Grouping” node has a method called “Constrained Optimized Binning” that appears inspired by isotonic regression, and is designed to attain monotonicity and optimality within additional user-specified bounds. (Note that this node is distinct from the general SAS EM “Interactive Binning” node that does non-monotonic, non-optimal binning with user-specified bounds.) But there’s still a catch (besides the additional cost). From the patent application description, the objective is to minimize the sum of absolute differences between individual WoE values and their associated decision variables, which are proxies for the bin WoE. This cannot be done directly with pointwise data, for which WoE is undefined; the data must be pre-aggregated into “fine-grained bins” to an extent that each “fine-grained bin” has at least one of both events and one non-events. Differences in pre-aggregation can affect the final optimality, but the patent description doesn’t include any details on the pre-aggregation part. ********************************************************************************* MY WOEFUL BINNING FORMULATION: This is a pure BLIP (Binary Linear Integer Program) formulation, unlike the SAS EM method, which is a full MILP (Mixed Integer Linear Program). In this formulation, every possible bin gets its own decision variable. Since bins are just sub-intervals, fully specified by their two endpoints, this means the number of variables (columns) is O(N**2), where N is the number of data points. This is a big practical disadvantage, although I think it should be less of a disadvantage than it has turned out to be in practice, but I would need to find a more clever formulation. For the objective function, just compute the appropriate metric, m[i,j], for each bin (i,j), and take the sum over all (i,j) of m[i,j]*v[i,j], where v[i,j] is the corresponding binary decision variable for bin (i,j): v[i,j] = 1 if (i,j) is chosen as one of the bins, v[i,j] = 0 if (i,j) is not chosen as one of the bins Some metric examples are: Chi-square: m[i,j] = ((N*e[i,j]) – (E*n[i,j]))**2 / (N*e[i,j]*(n[i,j] - e[i,j])) Information Value: m[i,j] = ((e[i,j] / E) – ((n[i,j] - e[i,j]) / (N - E)))*(log((e[i,j] / E) / ((n[i,j] - e[i,j]) / (N - E)))) Mean Sum of Squares: m[i,j] = (e[i,j]*(n[i,j] - e[i,j]) / (N*n[i,j]) where n[i,j] is the number of points in (i,j), e[i,j] is the number of events in (i,j), N is the total number of points, and E is the total number of events. Chi-square and Information Value should be maximized over the chosen bins, Mean Sum of Squares should be minimized over the chosen bins. The constraints are: 1. Every point should be in exactly one bin. This can be expressed in different ways, but the number of such constraints is O(N). 2. Upper bound on number of bins, one constraint if desired. Sum of all v[i,j] ≤ upper bound. 3. Lower bound on number of bins, one constraint if desired. Sum of all v[i,j] ≥ lower bound. 4. Upper and / or lower bounds on number of points per bin do not require constraints, just eliminate the decision variables for bins that are outside of the bounds. 5. Event rate monotonicity. At any bin endpoint p, the sum over i of ((e[i,p]*v[i,p]) / n[i,p]) must be ≤ the sum over j of ((e[(p+1),j]*v[(p+1),j]) / n[(p+1),j]) (for increasing monotonicity, reverse for decreasing). You can also make the difference ≥ a positive constant to require neighboring bins to behave differently. The number of such constraints is O(N). On top of the monotonicity constraints, I have experimented with more complex constraints to require event rate differences between neighboring bins to be statistically significant; I needed O(N**2) such constraints in my formulation. I did some experiments using a well-known, publicly available “German Credit” data set with 1,000 points. The binary target is called Creditability, with 700 events and 300 non-events, and the continuous predictor is Credit_Amount, with 923 distinct values ranging from 250 to 18,424. The attached code to read in the data (_import_GRIDWORK.GC_CREDIT_AMOUNT_AGG01IC.sas) gives the results already aggregated to distinct predictor values (nv_predictor_value). The number of points for each distinct predictor value is nv_w_ind_all, and the number of events for each distinct predictor value is nv_w_ind_one. I chose this data set because there was a paper about binning at SGF2018, called “Get Better Weight of Evidence for Scorecards Using a Genetic Algorithm.” Their criteria for good binning is a bit vague, they think the difference of WoE between groups should be as large as possible and look linear. Here is what they came up with, four monotonically decreasing bins: Obs i j wsize wones wzero wrate sos iv woe ent css diffwrate sigdifrate diffwoe 1 1 631 699 513 186 0.7339 0.1365 0.0189 0.1672 0.6629 0.1237 . . . 2 632 764 139 96 43 0.6907 0.0297 0.0003 -0.0442 0.1408 0.0292 -0.0433 0.0835 -0.2114 3 765 821 60 37 23 0.6167 0.0142 0.0089 -0.3719 0.0654 0.0132 -0.0740 0.1451 -0.3277 4 822 923 102 54 48 0.5294 0.0254 0.0604 -0.7295 0.1155 0.0254 -0.0873 0.1566 -0.3576 Total 1,000 700 300 0.2058 0.0884 0.9845 0.1915 Note that the minimum bin size (wsize) is 60 points, the smallest absolute event rate difference (diffwrate) between adjacent bins is 0.0433, and the smallest absolute WoE difference (diffwoe) between adjacent bins is 0.2114. To measure the linearity, the authors regressed WoE against Obs and found adjusted R**2 of 0.9814. The overall information value (iv) is 0.0884. Using the BLIP formulation with the objective of maximizing information value (iv), specifying four bins, with a minimum bin size of 60 and a minimum absolute event rate difference between adjacent bins of 0.08, I found the following (attachment _122_gco099_fullstimer_optmodel_milpsolve_max_iv_clean_4-g_60-minsz_no-maxsz_0.08-mindif_desc_impure_modified_metrics.sas): Obs i j wsize wones wzero wrate sos iv woe ent css diffwrate sigdifrate diffwoe 1 1 654 724 537 187 0.7417 0.1387 0.0299 0.2076 0.6771 0.1259 . . . 2 655 782 133 87 46 0.6541 0.0301 0.0061 -0.2100 0.1404 0.0296 -0.0876 0.0869 -0.4176 3 783 862 82 47 35 0.5732 0.0201 0.0274 -0.5525 0.0916 0.0191 -0.0810 0.1342 -0.3425 4 863 923 61 29 32 0.4754 0.0152 0.0617 -0.9457 0.0691 0.0152 -0.0978 0.1648 -0.3932 Total 1,000 700 300 0.2041 0.1250 0.9782 0.1897 Note that the minimum bin size (wsize) is 61 points, the smallest absolute event rate difference (diffwrate) between adjacent bins is 0.0810, and the smallest absolute WoE difference (diffwoe) between adjacent bins is 0.3425. To measure the linearity, regressing WoE against Obs gives adjusted R**2 of 0.9980. The overall information value (iv) is 0.1250. It took much longer to run the optimization than to find the answer by exhaustive enumeration. Using brute force exhaustive enumeration, it is possible to find a set of four bins whose minimum event rate difference is 0.08573 and minimum WoE difference is 0.3623. Its adjusted R**2 is 0.9991, and its overall information value (iv) is 0.1222. I was able to modify my BLIP formulation to formulate a MILP to find the binning with the maximum possible minimum event rate difference, but the MILP ran out of memory and would not converge on this data. I also ran the BLIP where I still used the monotonicity constraints, but removed the minimum event rate difference specification and added in the significant difference constraints. If I removed the constraints on the number of bins, it ran out of memory and did not converge. I was able to get it to run to completion with an upper bound of seven bins, no upper or lower limit on the number of points per bin. The solution has three bins (attachment _123_gco100_fullstimer_optmodel_milpsolve_max_iv_clean_1-7-g_60-minsz_no-maxsz_0-mindif_sigrdif_desc_impure_.sas): Obs i j wsize wones wzero wrate sos iv woe ent css diffwrate sigdifrate diffwoe 1 1 668 739 550 189 0.7443 0.1407 0.0344 0.2209 0.6878 0.1278 . . . 2 669 848 186 116 70 0.6237 0.0437 0.0231 -0.3422 0.2017 0.0422 -0.1206 0.0764 -0.5631 3 849 923 75 34 41 0.4533 0.0186 0.0911 -1.0345 0.0846 0.0186 -0.1703 0.1324 -0.6923 Total 1,000 700 300 0.2029 0.1487 0.9741 0.1886 I ran many other experiments, but most of the time the optimization would not converge, even when the number of possible solutions was only a few million. When the optimization does converge, it takes a long time to do so. If you look at my optimization code and see some ways to improve it, please post! Thanks!

Online Status	Offline
Date Last Visited	‎11-18-2024 08:08 PM

Re: Binning (categorize continuous var into categories)

Re: Univariate Box Cox Transformation algorithm, PROC TRANSREG, and mi...

Re: Univariate Box Cox Transformation algorithm, PROC TRANSREG, and mi...

Re: Univariate Box Cox Transformation algorithm, PROC TRANSREG, and mi...

Re: Univariate Box Cox Transformation algorithm, PROC TRANSREG, and mi...

Re: Univariate Box Cox Transformation algorithm, PROC TRANSREG, and mi...

Re: Univariate Box Cox Transformation algorithm, PROC TRANSREG, and mi...

Re: Univariate Box Cox Transformation algorithm, PROC TRANSREG, and mi...

Univariate Box Cox Transformation algorithm, PROC TRANSREG, and minimu...

Re: How to graph overlapping bell curves?

Re: Univariate Box Cox Transformation algorithm, PROC TRANSREG, and mi...

Re: Univariate Box Cox Transformation algorithm, PROC TRANSREG, and mi...

Re: Univariate Box Cox Transformation algorithm, PROC TRANSREG, and mi...

Re: Univariate Box Cox Transformation algorithm, PROC TRANSREG, and mi...

Re: Univariate Box Cox Transformation algorithm, PROC TRANSREG, and mi...

Re: Binning (categorize continuous var into categories)

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binni...

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binni...

Re: Newbie question on using PROC OPTMODEL for simple two-index MILP m...

Newbie question on using PROC OPTMODEL for simple two-index MILP minim...

Re: How to get unequal sized stratified groups from PROC SURVEYSELECT?

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binni...

Re: How to get unequal sized stratified groups from PROC SURVEYSELECT?

How to get unequal sized stratified groups from PROC SURVEYSELECT?

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binni...

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binni...

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binni...

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binni...

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binni...

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binni...

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binni...

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binni...

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binni...

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binni...

Trying to use PROC OPTMODEL for monotonic supervised optimal binning o...

SAS Analytics Explorers