BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
zzzzz
Calcite | Level 5

Hello- RF is kind of different than many other ML algorithms in that OOB is really a type of validation. So then the question is, exactly what value does the validation data have? My thinking is that I would get more value from my data by having one larger training dataset than two separate training and validation datasets. Of course, regardless, I would still have my test dataset. Thoughts?

 

Thanks, -Ted

1 ACCEPTED SOLUTION

Accepted Solutions
PadraicGNeville
SAS Employee

Re:  more value from my data by having one larger training dataset than two separate training and validation datasets.

Yes, in the common situation where more training data is useful.   Leo Breiman agreed.  

 

Reasons to use validation data despite this:

 

1. In practice, OOB error rates are often biased on the conservative side.   Error rates decrease with the number trees, at least initially.   OOB error rates are based on about 1/3 of the trees for a specific data observation.  Consequently, if the forest has 100 trees then the OOB error rate is closer to the test data error rate on 33 trees than on 100 trees.

 

2. OOB estimates are not directly comparible to other algorithms that use validation estimates.  So how would one confirm that a forest is better than a neural network unless the forest is applied to the same validation data?  Using the test set to select a model is risky if the data is easily overfit.

 

Hope this helps,

-Padraic

View solution in original post

1 REPLY 1
PadraicGNeville
SAS Employee

Re:  more value from my data by having one larger training dataset than two separate training and validation datasets.

Yes, in the common situation where more training data is useful.   Leo Breiman agreed.  

 

Reasons to use validation data despite this:

 

1. In practice, OOB error rates are often biased on the conservative side.   Error rates decrease with the number trees, at least initially.   OOB error rates are based on about 1/3 of the trees for a specific data observation.  Consequently, if the forest has 100 trees then the OOB error rate is closer to the test data error rate on 33 trees than on 100 trees.

 

2. OOB estimates are not directly comparible to other algorithms that use validation estimates.  So how would one confirm that a forest is better than a neural network unless the forest is applied to the same validation data?  Using the test set to select a model is risky if the data is easily overfit.

 

Hope this helps,

-Padraic

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 1681 views
  • 2 likes
  • 2 in conversation