BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
frupaul
Quartz | Level 8

Hi everyone,

This question might appear silly but it is really important for my work. When using Random Forest, bagging and boosting for Decision tree models, is there need to:

1. Split the data into training and validation data sets, considering that these methods create and use different samples of data to train the model?

2. Transform the variables, considering that Decision trees normally dont require variable transformations?

3. Handle missing values, considering that decision trees automatically handle missing values?

Thanks,

 

1 ACCEPTED SOLUTION

Accepted Solutions
ssoti2001
SAS Employee
1) if you use Random Forest RF they is no need to have a training/validation because RF internal fits a bunch of average models on a bagged random sample and provides fit statistics measures on out of bag sample which is its own internal validation.

2) Decision Trees based models are not functional models e.g linear regression that assumes linearity between independent and dependent also benefits from transformation hence transformation may not give you an uplift and can potentially be redundant.

3) one of the benefits of using tree based model is handling missing data so let the tree do this for you.

View solution in original post

4 REPLIES 4
ssoti2001
SAS Employee
1) if you use Random Forest RF they is no need to have a training/validation because RF internal fits a bunch of average models on a bagged random sample and provides fit statistics measures on out of bag sample which is its own internal validation.

2) Decision Trees based models are not functional models e.g linear regression that assumes linearity between independent and dependent also benefits from transformation hence transformation may not give you an uplift and can potentially be redundant.

3) one of the benefits of using tree based model is handling missing data so let the tree do this for you.
sinabl
SAS Employee

1. Random Forest algorithm has a mechanism to do internal validation. However, from the modeling perspective, i feel one should still keep some portion of the data as a separate validation sample to check if the model is generalizing well or not.

2. Sometimes converting continuous variable into a discrete variable using binning techniques helps decision trees from the stability perspective. It is slightly context dependent.

3. One can always see the difference between results with and without imputation of missing values. Sometimes it is significant and sometimes it is not. 

 

Hope this helps, 

 

Best,

abhijit

PadraicGNeville
SAS Employee

1. If there is enough data, yes partition.  Single and boosted trees can use the validation data to shrink the model and mitigate overfitting. The main alternatives are cross-validation, and assuming overfitting doesn't matter.  Leo Breiman believed that Out-Of-Bag data obviated the need for validation data in forests. However, when comparing the prediction with models that do require validation data, the comparison should be based on the same data, the data partition required for the other models.  

 

2. Classic trees depend on the ranks of interval inputs but not and interval target.   Log-transforms or truncation of  targets with highly skewed distributions such as the extreme value distribution are appropriate.  Also, to increase speed, some algorithms combine input values, which can produce different results after transformations.  This is not usually important.

 

3.  Indeed, if the algorithm handles missing values and missing values are predictive and enough observations with missing values are in the data, then yes, letting the algorithm handling it is better. 

WayneThompson
SAS Employee

Very thorough reply by Neville who implemented all of these methods in SAS.  A couple of supporting comments:  

 

1. Outliers The base model used in RF is a large decision tree. Decision trees are robust to outliers, because they isolate them in small regions of the feature space. Then, since the prediction for each leaf is the average (for regression) or the majority class (for classification), being isolated in separate leaves, outliers won't influence the rest of the predictions (in the case of regression for instance, they would not impact the mean of the other leaves)

2, Validation Data -- Yes please use in teh common case of rare events where the OOB might not be sufficient. 

3. Transforms - consider continous Y as with many algorithms. 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 3358 views
  • 2 likes
  • 5 in conversation