BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
zzzzz
Calcite | Level 5

Hello- RF is kind of different than many other ML algorithms in that OOB is really a type of validation. So then the question is, exactly what value does the validation data have? My thinking is that I would get more value from my data by having one larger training dataset than two separate training and validation datasets. Of course, regardless, I would still have my test dataset. Thoughts?

 

Thanks, -Ted

1 ACCEPTED SOLUTION

Accepted Solutions
PadraicGNeville
SAS Employee

Re:  more value from my data by having one larger training dataset than two separate training and validation datasets.

Yes, in the common situation where more training data is useful.   Leo Breiman agreed.  

 

Reasons to use validation data despite this:

 

1. In practice, OOB error rates are often biased on the conservative side.   Error rates decrease with the number trees, at least initially.   OOB error rates are based on about 1/3 of the trees for a specific data observation.  Consequently, if the forest has 100 trees then the OOB error rate is closer to the test data error rate on 33 trees than on 100 trees.

 

2. OOB estimates are not directly comparible to other algorithms that use validation estimates.  So how would one confirm that a forest is better than a neural network unless the forest is applied to the same validation data?  Using the test set to select a model is risky if the data is easily overfit.

 

Hope this helps,

-Padraic

View solution in original post

1 REPLY 1
PadraicGNeville
SAS Employee

Re:  more value from my data by having one larger training dataset than two separate training and validation datasets.

Yes, in the common situation where more training data is useful.   Leo Breiman agreed.  

 

Reasons to use validation data despite this:

 

1. In practice, OOB error rates are often biased on the conservative side.   Error rates decrease with the number trees, at least initially.   OOB error rates are based on about 1/3 of the trees for a specific data observation.  Consequently, if the forest has 100 trees then the OOB error rate is closer to the test data error rate on 33 trees than on 100 trees.

 

2. OOB estimates are not directly comparible to other algorithms that use validation estimates.  So how would one confirm that a forest is better than a neural network unless the forest is applied to the same validation data?  Using the test set to select a model is risky if the data is easily overfit.

 

Hope this helps,

-Padraic

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 1900 views
  • 2 likes
  • 2 in conversation