BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
TMKAIG1
Fluorite | Level 6

Hi!

I was playing around with the optimal binning transformation in the Transform Variables node of Enterprise Miner 13.1.  I created a 10,000 point data set, with an interval-valued predictor variable, the sequence 1 to 10000, and a binary target variable that has the following pattern (repeat each entry 400 times):

0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1

I wanted to create four bins.  You can pick them out by eyeball:

0 0 0 0 0  (1 - 2000 in the predictor)

1 1 1 1 1  (2001 - 4000 in the predictor)

0 0 0 0 0 0 0 (4001 - 6800 in the predictor)

1 0 1 1 1 1 1 1  (6801 - 10000 in the predictor)

Well, there's some ambiguity in those last two bins, maybe a better choice would be:

0 0 0 0 0 0 0 1 0  (4001 - 7600 in the predictor)

1 1 1 1 1 1  (7601 - 10000 in the predictor)

It turns out the first choice was better by the two most popular metrics, minimum gini or maximum chi-square.  R has a new binning package called smbinning, which uses conditional inference trees, and comes up with the second choice in this example.

So, which one did Enterprise Miner come up with?  Neither, actually.  It chose the following four bins in the predictor:

1 - 2000

2001 - 2693

2694 - 7600

7601 - 10000

which is markedly inferior in either chi-square or gini.  So, in what sense is it optimal?

But Enterprise Miner doesn't tell what its metric is, and doesn't give any tuning options for its decision tree, in the standard documentation.

EM has a bunch of hidden features, some of which are privately documented and can be wheedled occasionally from sympathetic technical support staff.  Are there any hidden features for optimal binning, or is this the best you can get?

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
M_Maldonado
Barite | Level 11

TMKAIG,

You just made me miss my old job at credit scoring--just a little bit. I had that feeling of "I hate to love when computers beat me!".

I ran the default IGN, which came with a different grouping. I was expecting it to be a little worse than our eyeball, because c'mon proc eyeball rocks...

Then I manually added the alternative groupings of eyeball and smbinning, because it only takes a minute to do it on the interactive menu.

Comparison results: IGN wins, eyeball is a close second, smbinning last.

Here a table with the results, and screenshots as an appendix. I hope another set of eyes make sure everything is OK... I have embarrassed myself more than once on the SAS communities :smileysilly:.

MetricInteractive Grouping Node (default)EyeballSMBinning
Information Value14.20112.27911.991
Gini97.75695.51394.872

TMKAIG, last favor. You mentioned Chi-square. I never really used it, I always used IV. Do you recommend it, yes, no, why? Also if you have data sets that I can play with, please send them my way. I frequently beat the default IGN, but sometimes it is really useful, so I never make up my mind if I should run my preferred settings vs default first. I will do that on some free time.

Thanks for the interesting exercise, and I hope this helps!

-Miguel

Appendix 1: SAS Credit Scoring for EM - Interactive Grouping Node results

0-2000

2001-4000

4001-6500

6501-8000

8001-10000

Appendix 2: Screenshots

IGN default run

ign_default.png

IGN vs eyeball

ign_default vs eyeball.png

IGN vs smbinning

ign_default vs smbinning.png

View solution in original post

7 REPLIES 7
M_Maldonado
Barite | Level 11

Hi TMKAIG1,

Do you mind if you borrow the SAS code you used to generate that data set?

I am really interested in optimal binning and I use a lot the Interactive Grouping Node from Credit Scoring in Enterprise Miner.

Really curious to see the binning solution from the default IG node.

Please share that data set (or code). Really interested in seeing some comparisons vs smbinning...

Thanks!

-Miguel

gergely_batho
SAS Employee

OP uses the Transform Variables node!

data result;

  array temp [25] _temporary_ (0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1);

  do k=1 to dim(temp);

  do i=1 to 400;

  x+1;

  target=temp;

  output;

  end;

  end;

run;

TMKAIG1
Fluorite | Level 6

Hi Miguel!

Thank you for responding.  Gergely's code might be slightly more efficient (it definitely would be if he'd put the target assignment before the start of the inner loop), but my code to create the SAS data set is similar:

%let pattern  = %str(0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1) ;

%let pattern_length = %length(&pattern.) ;

%let pattern_count = %sysfunc(countw(&pattern.)) ;

%let expansion = 400 ; %* number of data points per pattern point, data size = pattern_count*expansion ;

%let number_of_points = %sysevalf(&pattern_count. * &expansion.) ;

data staging.gini_test_input ;
keep interval_predictor binary_target ;
length pattern $&pattern_length.. ;
pattern = "&pattern." ;
put pattern= ;
do paix = 1 to &pattern_count. ;
  binary_target = input(scan(pattern, paix), best32.) ;
  do expix = 1 to &expansion. ;
   interval_predictor + 1 ;
   output ;
  end ;
end ;
run ;

I'm also curious to see how the Interactive Grouping Node will partition this example.  We don't license the Credit Scoring Application, so I can't test it directly.

The R code is for using smbinning is:

library(smbinning)

bt3p<-as.integer(c(rep(0,5),rep(1,5),rep(0,7),1,0,rep(1,6)))

bt3_xg<-rep(bt3p,each=400)

ipxg<-as.double(c(1:length(bt3_xg)))

bt3_xg.data<-data.frame(bt3_xg,ipxg)

bt3_xg.t1<-smbinning(bt3_xg.data,y="bt3_xg",x="ipxg",p=0.2)

Thanks!

M_Maldonado
Barite | Level 11

TMKAIG,

You just made me miss my old job at credit scoring--just a little bit. I had that feeling of "I hate to love when computers beat me!".

I ran the default IGN, which came with a different grouping. I was expecting it to be a little worse than our eyeball, because c'mon proc eyeball rocks...

Then I manually added the alternative groupings of eyeball and smbinning, because it only takes a minute to do it on the interactive menu.

Comparison results: IGN wins, eyeball is a close second, smbinning last.

Here a table with the results, and screenshots as an appendix. I hope another set of eyes make sure everything is OK... I have embarrassed myself more than once on the SAS communities :smileysilly:.

MetricInteractive Grouping Node (default)EyeballSMBinning
Information Value14.20112.27911.991
Gini97.75695.51394.872

TMKAIG, last favor. You mentioned Chi-square. I never really used it, I always used IV. Do you recommend it, yes, no, why? Also if you have data sets that I can play with, please send them my way. I frequently beat the default IGN, but sometimes it is really useful, so I never make up my mind if I should run my preferred settings vs default first. I will do that on some free time.

Thanks for the interesting exercise, and I hope this helps!

-Miguel

Appendix 1: SAS Credit Scoring for EM - Interactive Grouping Node results

0-2000

2001-4000

4001-6500

6501-8000

8001-10000

Appendix 2: Screenshots

IGN default run

ign_default.png

IGN vs eyeball

ign_default vs eyeball.png

IGN vs smbinning

ign_default vs smbinning.png

TMKAIG1
Fluorite | Level 6

Hi Miguel!

Thank you for running the Interactive Grouping node and reporting back the results.  For your results you got five bins.  I had been restricting my search to four bins.  If I let the Enterprise Miner Transform Variables node try for five bins, it comes up with:

1 - 2000

2001 - 5117

5118 - 6800

6801 - 7600

7601 - 10000

Interestingly enough, it's not a refinement of its four bin solution, but at least it has a higher Gini: 86.829 > 78.115.  But it's not nearly as good as the Interactive Grouping solution -- are you paying attention, Enterprise Miner development team?

But you know what?  Here's the "proc eyeball" five bin solution:

1 - 2000

2001 - 4000

4001 - 6800

6801 - 7600

7601 - 10000

and the Gini is... 99.358

So I'm not ready to give up on "proc eyeball" just yet.

By the way, if you run smbinning as in the code I posted previously, but drop the p=0.2 restriction, you'll get the same five bin solution as "proc eyeball."

Thanks!

M_Maldonado
Barite | Level 11

TMKAIG,

I just confirmed that proc arbor, EM's first decision tree procedure, runs behind the scenes for the optimal binning in the Transform node. The option "Number of bins" should probably be renamed to "Maximum number of bins" because it controls the property MaxBranch (maximum number of branches) for that decision tree.

Good news for you is that if you specify this property as a large number e.g. 50, Transform node will come back with 6 bins that beat eyeball and a Gini of 100.

I still hope that you get a chance to use Credit Scoring in the near future. It has a lot more flexibility including visual guidance and manual overrides. It has also extra steps that help IGN come up with better binnings than transform node.

When you have a chance, please comment on Chisquare. I am still wondering why smbinning considers it important and if we should consider including it in the near future.

Thanks,

Miguel

TMKAIG1
Fluorite | Level 6

Hi Miguel!

Thank you for continuing the conversation.  I created this test data set to compare against different implementations of "optimal binning."  I designed it specifically to be tricky to find the best four bins.  In practice, if you're looking in detail at only a few predictors, you wouldn't restrict yourself so severely, and if there were six perfect bins, as in this case, you'd find them.  But when you're dealing with hundreds or thousands of predictors as you might in a typical data mining exercise, then it's not at all uncommon to go with something like the best four or five bins, and you'd like to feel confident that your algorithm can locate them.  Right now I feel a bit queasy about Enterprise Miner.  If I allowed the Transform Variables node to search for 25 bins with my example data set, it found the six perfect ones.  But if I limited it to 17 bins, it only found four of the uniform bins, and it split the two smallest ones into three pieces, one of which had only 92 points, less than 1% of the data.  A bin that small could be significant, or it could be a blip.  In my opinion, that's another weakness of the Transform Variables node -- you can't specify a minimum bin size.  The R package, smbinning, has the opposite constraint, you can specify a minimum bin size, but not a maximum bin number; by the way, the smbinning function has a default minimum bin size of 5%, which is why it finds five bins for this data set with its default setting.  If you submit:

bt3_xg.t3<-smbinning(bt3_xg.data,y="bt3_xg",x="ipxg",p=0.04)

it will find the six uniform bins (because the two smallest bins each have 4% of the data).

About three years ago Ivan Oliveira in the Enterprise Miner development team described to me some improvements in optimal binning that were being added to EM 7.1.  It strikes me now that he was talking only about the Interactive Grouping node in the Credit Scoring application, but at the time I thought he was referring to the Transform Variables node.  I wish SAS would get the optimal binning in the Transform Variables node up to speed with the Interactive Grouping node.  Okay, I promise to stop ranting now (or at least in the not too distant future).

Chi-square is a useful measure of dependence / association that goes back to Karl Pearson and the early days of the science of Statistics in the late nineteenth / early twentieth centuries.  One of the best things about it is that its distribution is well-understood, so you can generate p-values and perform significance tests based on your results.  I don't know whether anyone has ever bothered to do that for Gini or entropy value distributions.  The original automated decision tree algorithms eventually coalesced (about forty years ago?) into CHAID, which uses chi-square as its splitting rule measure of association.  One useful property of chi-square is the way it scales fractally -- if you break up a big group into k identical subgroups, then the sum of the subgroup chi-squares equals the big group chi-square.  But that's actually a bit of a drawback for optimal binning, because if a bin can split into two identical sub-bins, you'd rather just keep the original large bin, and this gives you no incentive to do so.  Most of the association measures I've seen: Gini, entropy, information value, weight of evidence, within group sum of squares,... all have the property that they improve with increased granularity, so that if you removed all restrictions on the number or size of bins, they'd give you as many bins as data points; even with the drawback described above, chi-square will favor some lumpiness over complete granularity.  And if you remove the chi-square denominators, so that you take the sum of the squared differences between the actual and expected number of hits in each group, you naturally seek out larger bins.

Thanks!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 8584 views
  • 0 likes
  • 3 in conversation