Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Random Forests: Difference between OOB and Validation?

Accepted Solution Solved
Reply
User
Posts: 1
Accepted Solution

Random Forests: Difference between OOB and Validation?

Hello- RF is kind of different than many other ML algorithms in that OOB is really a type of validation. So then the question is, exactly what value does the validation data have? My thinking is that I would get more value from my data by having one larger training dataset than two separate training and validation datasets. Of course, regardless, I would still have my test dataset. Thoughts?

 

Thanks, -Ted


Accepted Solutions
Solution
‎07-10-2017 04:07 PM
SAS Employee
Posts: 32

Re: Random Forests: Difference between OOB and Validation?

Re:  more value from my data by having one larger training dataset than two separate training and validation datasets.

Yes, in the common situation where more training data is useful.   Leo Breiman agreed.  

 

Reasons to use validation data despite this:

 

1. In practice, OOB error rates are often biased on the conservative side.   Error rates decrease with the number trees, at least initially.   OOB error rates are based on about 1/3 of the trees for a specific data observation.  Consequently, if the forest has 100 trees then the OOB error rate is closer to the test data error rate on 33 trees than on 100 trees.

 

2. OOB estimates are not directly comparible to other algorithms that use validation estimates.  So how would one confirm that a forest is better than a neural network unless the forest is applied to the same validation data?  Using the test set to select a model is risky if the data is easily overfit.

 

Hope this helps,

-Padraic

View solution in original post


All Replies
Solution
‎07-10-2017 04:07 PM
SAS Employee
Posts: 32

Re: Random Forests: Difference between OOB and Validation?

Re:  more value from my data by having one larger training dataset than two separate training and validation datasets.

Yes, in the common situation where more training data is useful.   Leo Breiman agreed.  

 

Reasons to use validation data despite this:

 

1. In practice, OOB error rates are often biased on the conservative side.   Error rates decrease with the number trees, at least initially.   OOB error rates are based on about 1/3 of the trees for a specific data observation.  Consequently, if the forest has 100 trees then the OOB error rate is closer to the test data error rate on 33 trees than on 100 trees.

 

2. OOB estimates are not directly comparible to other algorithms that use validation estimates.  So how would one confirm that a forest is better than a neural network unless the forest is applied to the same validation data?  Using the test set to select a model is risky if the data is easily overfit.

 

Hope this helps,

-Padraic

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 1 reply
  • 515 views
  • 2 likes
  • 2 in conversation