Solved: Boosting, bagging and Random Forest

frupaul · Posted 04-28-2017 07:04 AM

Hi everyone,

This question might appear silly but it is really important for my work. When using Random Forest, bagging and boosting for Decision tree models, is there need to:

1. Split the data into training and validation data sets, considering that these methods create and use different samples of data to train the model?

2. Transform the variables, considering that Decision trees normally dont require variable transformations?

3. Handle missing values, considering that decision trees automatically handle missing values?

Thanks,

ssoti2001 · Posted 05-02-2017 04:44 PM

1) if you use Random Forest RF they is no need to have a training/validation because RF internal fits a bunch of average models on a bagged random sample and provides fit statistics measures on out of bag sample which is its own internal validation.

2) Decision Trees based models are not functional models e.g linear regression that assumes linearity between independent and dependent also benefits from transformation hence transformation may not give you an uplift and can potentially be redundant.

3) one of the benefits of using tree based model is handling missing data so let the tree do this for you.

View solution in original post

ssoti2001 · Posted 05-02-2017 04:44 PM

1) if you use Random Forest RF they is no need to have a training/validation because RF internal fits a bunch of average models on a bagged random sample and provides fit statistics measures on out of bag sample which is its own internal validation.

2) Decision Trees based models are not functional models e.g linear regression that assumes linearity between independent and dependent also benefits from transformation hence transformation may not give you an uplift and can potentially be redundant.

3) one of the benefits of using tree based model is handling missing data so let the tree do this for you.

sinabl · Posted 05-03-2017 12:55 AM

1. Random Forest algorithm has a mechanism to do internal validation. However, from the modeling perspective, i feel one should still keep some portion of the data as a separate validation sample to check if the model is generalizing well or not.

2. Sometimes converting continuous variable into a discrete variable using binning techniques helps decision trees from the stability perspective. It is slightly context dependent.

3. One can always see the difference between results with and without imputation of missing values. Sometimes it is significant and sometimes it is not.

Hope this helps,

Best,

abhijit

PadraicGNeville · Posted 05-03-2017 09:15 AM

1. If there is enough data, yes partition. Single and boosted trees can use the validation data to shrink the model and mitigate overfitting. The main alternatives are cross-validation, and assuming overfitting doesn't matter. Leo Breiman believed that Out-Of-Bag data obviated the need for validation data in forests. However, when comparing the prediction with models that do require validation data, the comparison should be based on the same data, the data partition required for the other models.

2. Classic trees depend on the ranks of interval inputs but not and interval target. Log-transforms or truncation of targets with highly skewed distributions such as the extreme value distribution are appropriate. Also, to increase speed, some algorithms combine input values, which can produce different results after transformations. This is not usually important.

3. Indeed, if the algorithm handles missing values and missing values are predictive and enough observations with missing values are in the data, then yes, letting the algorithm handling it is better.

WayneThompson · Posted 05-03-2017 10:37 AM

Very thorough reply by Neville who implemented all of these methods in SAS. A couple of supporting comments:

1. Outliers The base model used in RF is a large decision tree. Decision trees are robust to outliers, because they isolate them in small regions of the feature space. Then, since the prediction for each leaf is the average (for regression) or the majority class (for classification), being isolated in separate leaves, outliers won't influence the rest of the predictions (in the case of regression for instance, they would not impact the mean of the other leaves)

2, Validation Data -- Yes please use in teh common case of rare events where the OOB might not be sufficient.

3. Transforms - consider continous Y as with many algorithms.

Boosting, bagging and Random Forest

Re: Boosting, bagging and Random Forest

Re: Boosting, bagging and Random Forest

Re: Boosting, bagging and Random Forest

Re: Boosting, bagging and Random Forest

Re: Boosting, bagging and Random Forest

Boosting, bagging and Random Forest

Re: Boosting, bagging and Random Forest

Re: Boosting, bagging and Random Forest

Re: Boosting, bagging and Random Forest

Re: Boosting, bagging and Random Forest

Re: Boosting, bagging and Random Forest

SAS Innovate 2025: Call for Content