Hi everyone,
This question might appear silly but it is really important for my work. When using Random Forest, bagging and boosting for Decision tree models, is there need to:
1. Split the data into training and validation data sets, considering that these methods create and use different samples of data to train the model?
2. Transform the variables, considering that Decision trees normally dont require variable transformations?
3. Handle missing values, considering that decision trees automatically handle missing values?
Thanks,
1. Random Forest algorithm has a mechanism to do internal validation. However, from the modeling perspective, i feel one should still keep some portion of the data as a separate validation sample to check if the model is generalizing well or not.
2. Sometimes converting continuous variable into a discrete variable using binning techniques helps decision trees from the stability perspective. It is slightly context dependent.
3. One can always see the difference between results with and without imputation of missing values. Sometimes it is significant and sometimes it is not.
Hope this helps,
Best,
abhijit
1. If there is enough data, yes partition. Single and boosted trees can use the validation data to shrink the model and mitigate overfitting. The main alternatives are cross-validation, and assuming overfitting doesn't matter. Leo Breiman believed that Out-Of-Bag data obviated the need for validation data in forests. However, when comparing the prediction with models that do require validation data, the comparison should be based on the same data, the data partition required for the other models.
2. Classic trees depend on the ranks of interval inputs but not and interval target. Log-transforms or truncation of targets with highly skewed distributions such as the extreme value distribution are appropriate. Also, to increase speed, some algorithms combine input values, which can produce different results after transformations. This is not usually important.
3. Indeed, if the algorithm handles missing values and missing values are predictive and enough observations with missing values are in the data, then yes, letting the algorithm handling it is better.
Very thorough reply by Neville who implemented all of these methods in SAS. A couple of supporting comments:
1. Outliers The base model used in RF is a large decision tree. Decision trees are robust to outliers, because they isolate them in small regions of the feature space. Then, since the prediction for each leaf is the average (for regression) or the majority class (for classification), being isolated in separate leaves, outliers won't influence the rest of the predictions (in the case of regression for instance, they would not impact the mean of the other leaves)
2, Validation Data -- Yes please use in teh common case of rare events where the OOB might not be sufficient.
3. Transforms - consider continous Y as with many algorithms.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.