Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Boosting, bagging and Random Forest

Accepted Solution Solved
Reply
Contributor
Posts: 38
Accepted Solution

Boosting, bagging and Random Forest

Hi everyone,

This question might appear silly but it is really important for my work. When using Random Forest, bagging and boosting for Decision tree models, is there need to:

1. Split the data into training and validation data sets, considering that these methods create and use different samples of data to train the model?

2. Transform the variables, considering that Decision trees normally dont require variable transformations?

3. Handle missing values, considering that decision trees automatically handle missing values?

Thanks,

 


Accepted Solutions
Solution
‎05-02-2017 05:33 PM
SAS Employee
Posts: 3

Re: Boosting, bagging and Random Forest

1) if you use Random Forest RF they is no need to have a training/validation because RF internal fits a bunch of average models on a bagged random sample and provides fit statistics measures on out of bag sample which is its own internal validation.

2) Decision Trees based models are not functional models e.g linear regression that assumes linearity between independent and dependent also benefits from transformation hence transformation may not give you an uplift and can potentially be redundant.

3) one of the benefits of using tree based model is handling missing data so let the tree do this for you.

View solution in original post


All Replies
Solution
‎05-02-2017 05:33 PM
SAS Employee
Posts: 3

Re: Boosting, bagging and Random Forest

1) if you use Random Forest RF they is no need to have a training/validation because RF internal fits a bunch of average models on a bagged random sample and provides fit statistics measures on out of bag sample which is its own internal validation.

2) Decision Trees based models are not functional models e.g linear regression that assumes linearity between independent and dependent also benefits from transformation hence transformation may not give you an uplift and can potentially be redundant.

3) one of the benefits of using tree based model is handling missing data so let the tree do this for you.
SAS Employee
Posts: 3

Re: Boosting, bagging and Random Forest

1. Random Forest algorithm has a mechanism to do internal validation. However, from the modeling perspective, i feel one should still keep some portion of the data as a separate validation sample to check if the model is generalizing well or not.

2. Sometimes converting continuous variable into a discrete variable using binning techniques helps decision trees from the stability perspective. It is slightly context dependent.

3. One can always see the difference between results with and without imputation of missing values. Sometimes it is significant and sometimes it is not. 

 

Hope this helps, 

 

Best,

abhijit

SAS Employee
Posts: 37

Re: Boosting, bagging and Random Forest

1. If there is enough data, yes partition.  Single and boosted trees can use the validation data to shrink the model and mitigate overfitting. The main alternatives are cross-validation, and assuming overfitting doesn't matter.  Leo Breiman believed that Out-Of-Bag data obviated the need for validation data in forests. However, when comparing the prediction with models that do require validation data, the comparison should be based on the same data, the data partition required for the other models.  

 

2. Classic trees depend on the ranks of interval inputs but not and interval target.   Log-transforms or truncation of  targets with highly skewed distributions such as the extreme value distribution are appropriate.  Also, to increase speed, some algorithms combine input values, which can produce different results after transformations.  This is not usually important.

 

3.  Indeed, if the algorithm handles missing values and missing values are predictive and enough observations with missing values are in the data, then yes, letting the algorithm handling it is better. 

SAS Employee
Posts: 31

Re: Boosting, bagging and Random Forest

Posted in reply to PadraicGNeville

Very thorough reply by Neville who implemented all of these methods in SAS.  A couple of supporting comments:  

 

1. Outliers The base model used in RF is a large decision tree. Decision trees are robust to outliers, because they isolate them in small regions of the feature space. Then, since the prediction for each leaf is the average (for regression) or the majority class (for classification), being isolated in separate leaves, outliers won't influence the rest of the predictions (in the case of regression for instance, they would not impact the mean of the other leaves)

2, Validation Data -- Yes please use in teh common case of rare events where the OOB might not be sufficient. 

3. Transforms - consider continous Y as with many algorithms. 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 650 views
  • 2 likes
  • 5 in conversation