Hi Miguel! Thank you for continuing the conversation. I created this test data set to compare against different implementations of "optimal binning." I designed it specifically to be tricky to find the best four bins. In practice, if you're looking in detail at only a few predictors, you wouldn't restrict yourself so severely, and if there were six perfect bins, as in this case, you'd find them. But when you're dealing with hundreds or thousands of predictors as you might in a typical data mining exercise, then it's not at all uncommon to go with something like the best four or five bins, and you'd like to feel confident that your algorithm can locate them. Right now I feel a bit queasy about Enterprise Miner. If I allowed the Transform Variables node to search for 25 bins with my example data set, it found the six perfect ones. But if I limited it to 17 bins, it only found four of the uniform bins, and it split the two smallest ones into three pieces, one of which had only 92 points, less than 1% of the data. A bin that small could be significant, or it could be a blip. In my opinion, that's another weakness of the Transform Variables node -- you can't specify a minimum bin size. The R package, smbinning, has the opposite constraint, you can specify a minimum bin size, but not a maximum bin number; by the way, the smbinning function has a default minimum bin size of 5%, which is why it finds five bins for this data set with its default setting. If you submit: bt3_xg.t3<-smbinning(bt3_xg.data,y="bt3_xg",x="ipxg",p=0.04) it will find the six uniform bins (because the two smallest bins each have 4% of the data). About three years ago Ivan Oliveira in the Enterprise Miner development team described to me some improvements in optimal binning that were being added to EM 7.1. It strikes me now that he was talking only about the Interactive Grouping node in the Credit Scoring application, but at the time I thought he was referring to the Transform Variables node. I wish SAS would get the optimal binning in the Transform Variables node up to speed with the Interactive Grouping node. Okay, I promise to stop ranting now (or at least in the not too distant future). Chi-square is a useful measure of dependence / association that goes back to Karl Pearson and the early days of the science of Statistics in the late nineteenth / early twentieth centuries. One of the best things about it is that its distribution is well-understood, so you can generate p-values and perform significance tests based on your results. I don't know whether anyone has ever bothered to do that for Gini or entropy value distributions. The original automated decision tree algorithms eventually coalesced (about forty years ago?) into CHAID, which uses chi-square as its splitting rule measure of association. One useful property of chi-square is the way it scales fractally -- if you break up a big group into k identical subgroups, then the sum of the subgroup chi-squares equals the big group chi-square. But that's actually a bit of a drawback for optimal binning, because if a bin can split into two identical sub-bins, you'd rather just keep the original large bin, and this gives you no incentive to do so. Most of the association measures I've seen: Gini, entropy, information value, weight of evidence, within group sum of squares,... all have the property that they improve with increased granularity, so that if you removed all restrictions on the number or size of bins, they'd give you as many bins as data points; even with the drawback described above, chi-square will favor some lumpiness over complete granularity. And if you remove the chi-square denominators, so that you take the sum of the squared differences between the actual and expected number of hits in each group, you naturally seek out larger bins. Thanks!
... View more