Random Forests: Difference between OOB and Validation?

zzzzz — Fri, 30 Jun 2017 19:42:39 GMT

Hello- RF is kind of different than many other ML algorithms in that OOB is really a type of validation. So then the question is, exactly what value does the validation data have? My thinking is that I would get more value from my data by having one larger training dataset than two separate training and validation datasets. Of course, regardless, I would still have my test dataset. Thoughts?

Thanks, -Ted

Re: Random Forests: Difference between OOB and Validation?

PadraicGNeville — Wed, 05 Jul 2017 17:27:10 GMT

Re: more value from my data by having one larger training dataset than two separate training and validation datasets.

Yes, in the common situation where more training data is useful. Leo Breiman agreed.

Reasons to use validation data despite this:

1. In practice, OOB error rates are often biased on the conservative side. Error rates decrease with the number trees, at least initially. OOB error rates are based on about 1/3 of the trees for a specific data observation. Consequently, if the forest has 100 trees then the OOB error rate is closer to the test data error rate on 33 trees than on 100 trees.

2. OOB estimates are not directly comparible to other algorithms that use validation estimates. So how would one confirm that a forest is better than a neural network unless the forest is applied to the same validation data? Using the test set to select a model is risky if the data is easily overfit.

Hope this helps,

-Padraic

topic Re: Random Forests: Difference between OOB and Validation? in SAS Data Science

Random Forests: Difference between OOB and Validation?

Re: Random Forests: Difference between OOB and Validation?