BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ronya
Calcite | Level 5

Hello all, I'm exploring the use of interactive grouping in SaS EMiner as a method to bin the values of interval characteristic and wish to ask about the pre-binning process. Why do we need to use the quantile or bucket method to pre-bin the interval variable values rather than apply Tree-based binning to the interval values directly?  

1 ACCEPTED SOLUTION

Accepted Solutions
pvareschi
Quartz | Level 8

Someone from SAS may be able to provide a more accurate respose, but, as far as I know, the algorithm behind the Interactive Grouping Node uses a two-step approach for interval variables:

1. First, it "discretizes" the variables by creating groups, essentially transforming the variables from interval to nominal

2. Secondly, applies a Tree-based logic to find the optimal binning based on the groups from step (1)

 

My understanding is the above approach is used only for computational efficiency reasons, because, in general, interval variables may have hundreds, if not, thousands of different values whihc would make it too computational intensive for a Tree algorithm to fully evaluate.

Therefore, by carrying out a pre-binning step, you end up with far fewer categories which then can be optimised based on a Tree-like algorithm.

 

Lastly, from my experience, unless you have a good reason for using "bucket", my advice is to always go for "quantile" (i.e. that should be the default approach unless, for some specific reason, you want to have groups defined by having the same width).

View solution in original post

6 REPLIES 6
Ksharp
Super User
quantile or bucket method are simple and easy to use.
If you are using Credit ScoreCard ,Tree-based binning can't guarantee the woe is monotonic .
ronya
Calcite | Level 5

Thanks Ksharp. If I understand correctly we can use either quantile, bucket OR tree method for binning? Is that correct? 

The documentation states that quantile/bucket binning is a pre-bin stage before a Tree based method can be applied:

"The Interactive Grouping node first performs binning on the interval characteristic. You can choose between two binning methods: quantile and bucket. The quantile method generates groups. The groups are formed by ranked quantities with approximately the same frequency in each group. The bucket method generates groups by dividing the data into evenly spaced intervals that are based on the difference between the maximum and minimum values.
After the interval variables have been pre-binned, a decision tree model is fitted for each characteristic. "
So is tree binning a sequential process starting with quantile/bucket pre-binning or we can use quantile, bucket and tree as alternative binning methods? 
Ksharp
Super User
I think quantile, bucket and tree are just three bin methods , you can use one of them .
Someone more like Tree , Someone more like quantile.

You could bin many groups like 20 by quantile, bucket method, and merge any two groups into one group to make Chisquare or Gini max , and so on , I think that is a tree method.
pvareschi
Quartz | Level 8

Someone from SAS may be able to provide a more accurate respose, but, as far as I know, the algorithm behind the Interactive Grouping Node uses a two-step approach for interval variables:

1. First, it "discretizes" the variables by creating groups, essentially transforming the variables from interval to nominal

2. Secondly, applies a Tree-based logic to find the optimal binning based on the groups from step (1)

 

My understanding is the above approach is used only for computational efficiency reasons, because, in general, interval variables may have hundreds, if not, thousands of different values whihc would make it too computational intensive for a Tree algorithm to fully evaluate.

Therefore, by carrying out a pre-binning step, you end up with far fewer categories which then can be optimised based on a Tree-like algorithm.

 

Lastly, from my experience, unless you have a good reason for using "bucket", my advice is to always go for "quantile" (i.e. that should be the default approach unless, for some specific reason, you want to have groups defined by having the same width).

WendyCzika
SAS Employee
Yes, that is all correct!
ronya
Calcite | Level 5

Many thanks for the detailed response and recommendation. I've had a chance to run the 2-stage process and see the binning/grouping process and their coarse/fine views 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1373 views
  • 1 like
  • 4 in conversation