Solved: Re: Binning and Pre-Binning in Interactive Grouping

ronya · Posted 08-04-2020 08:03 AM

Hello all, I'm exploring the use of interactive grouping in SaS EMiner as a method to bin the values of interval characteristic and wish to ask about the pre-binning process. Why do we need to use the quantile or bucket method to pre-bin the interval variable values rather than apply Tree-based binning to the interval values directly?

pvareschi · Posted 08-05-2020 10:42 AM

Someone from SAS may be able to provide a more accurate respose, but, as far as I know, the algorithm behind the Interactive Grouping Node uses a two-step approach for interval variables:

1. First, it "discretizes" the variables by creating groups, essentially transforming the variables from interval to nominal

2. Secondly, applies a Tree-based logic to find the optimal binning based on the groups from step (1)

My understanding is the above approach is used only for computational efficiency reasons, because, in general, interval variables may have hundreds, if not, thousands of different values whihc would make it too computational intensive for a Tree algorithm to fully evaluate.

Therefore, by carrying out a pre-binning step, you end up with far fewer categories which then can be optimised based on a Tree-like algorithm.

Lastly, from my experience, unless you have a good reason for using "bucket", my advice is to always go for "quantile" (i.e. that should be the default approach unless, for some specific reason, you want to have groups defined by having the same width).

View solution in original post

Ksharp · Posted 08-04-2020 08:59 AM

quantile or bucket method are simple and easy to use.
If you are using Credit ScoreCard ,Tree-based binning can't guarantee the woe is monotonic .

ronya · Posted 08-05-2020 03:52 AM

Thanks Ksharp. If I understand correctly we can use either quantile, bucket OR tree method for binning? Is that correct?

The documentation states that quantile/bucket binning is a pre-bin stage before a Tree based method can be applied:

"The Interactive Grouping node first performs binning on the interval characteristic. You can choose between two binning methods: quantile and bucket. The quantile method generates groups. The groups are formed by ranked quantities with approximately the same frequency in each group. The bucket method generates groups by dividing the data into evenly spaced intervals that are based on the difference between the maximum and minimum values.
After the interval variables have been pre-binned, a decision tree model is fitted for each characteristic. "

So is tree binning a sequential process starting with quantile/bucket pre-binning or we can use quantile, bucket and tree as alternative binning methods?

Ksharp · Posted 08-05-2020 06:57 AM

I think quantile, bucket and tree are just three bin methods , you can use one of them .
Someone more like Tree , Someone more like quantile.

You could bin many groups like 20 by quantile, bucket method, and merge any two groups into one group to make Chisquare or Gini max , and so on , I think that is a tree method.

pvareschi · Posted 08-05-2020 10:42 AM

Someone from SAS may be able to provide a more accurate respose, but, as far as I know, the algorithm behind the Interactive Grouping Node uses a two-step approach for interval variables:

1. First, it "discretizes" the variables by creating groups, essentially transforming the variables from interval to nominal

2. Secondly, applies a Tree-based logic to find the optimal binning based on the groups from step (1)

My understanding is the above approach is used only for computational efficiency reasons, because, in general, interval variables may have hundreds, if not, thousands of different values whihc would make it too computational intensive for a Tree algorithm to fully evaluate.

Therefore, by carrying out a pre-binning step, you end up with far fewer categories which then can be optimised based on a Tree-like algorithm.

Lastly, from my experience, unless you have a good reason for using "bucket", my advice is to always go for "quantile" (i.e. that should be the default approach unless, for some specific reason, you want to have groups defined by having the same width).

WendyCzika · Posted 08-07-2020 11:12 AM

Yes, that is all correct!

ronya · Posted 08-09-2020 06:46 AM

Many thanks for the detailed response and recommendation. I've had a chance to run the 2-stage process and see the binning/grouping process and their coarse/fine views