Hello all, I'm exploring the use of interactive grouping in SaS EMiner as a method to bin the values of interval characteristic and wish to ask about the pre-binning process. Why do we need to use the quantile or bucket method to pre-bin the interval variable values rather than apply Tree-based binning to the interval values directly?
Someone from SAS may be able to provide a more accurate respose, but, as far as I know, the algorithm behind the Interactive Grouping Node uses a two-step approach for interval variables:
1. First, it "discretizes" the variables by creating groups, essentially transforming the variables from interval to nominal
2. Secondly, applies a Tree-based logic to find the optimal binning based on the groups from step (1)
My understanding is the above approach is used only for computational efficiency reasons, because, in general, interval variables may have hundreds, if not, thousands of different values whihc would make it too computational intensive for a Tree algorithm to fully evaluate.
Therefore, by carrying out a pre-binning step, you end up with far fewer categories which then can be optimised based on a Tree-like algorithm.
Lastly, from my experience, unless you have a good reason for using "bucket", my advice is to always go for "quantile" (i.e. that should be the default approach unless, for some specific reason, you want to have groups defined by having the same width).
Thanks Ksharp. If I understand correctly we can use either quantile, bucket OR tree method for binning? Is that correct?
The documentation states that quantile/bucket binning is a pre-bin stage before a Tree based method can be applied:
"The Interactive Grouping node first performs binning on the interval characteristic. You can choose between two binning methods: quantile and bucket. The quantile method generates groups. The groups are formed by ranked quantities with approximately the same frequency in each group. The bucket method generates groups by dividing the data into evenly spaced intervals that are based on the difference between the maximum and minimum values.After the interval variables have been pre-binned, a decision tree model is fitted for each characteristic. "
Someone from SAS may be able to provide a more accurate respose, but, as far as I know, the algorithm behind the Interactive Grouping Node uses a two-step approach for interval variables:
1. First, it "discretizes" the variables by creating groups, essentially transforming the variables from interval to nominal
2. Secondly, applies a Tree-based logic to find the optimal binning based on the groups from step (1)
My understanding is the above approach is used only for computational efficiency reasons, because, in general, interval variables may have hundreds, if not, thousands of different values whihc would make it too computational intensive for a Tree algorithm to fully evaluate.
Therefore, by carrying out a pre-binning step, you end up with far fewer categories which then can be optimised based on a Tree-like algorithm.
Lastly, from my experience, unless you have a good reason for using "bucket", my advice is to always go for "quantile" (i.e. that should be the default approach unless, for some specific reason, you want to have groups defined by having the same width).
Many thanks for the detailed response and recommendation. I've had a chance to run the 2-stage process and see the binning/grouping process and their coarse/fine views
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.