Solved: Random Forests: Difference between OOB and Validation?

zzzzz · Posted 06-30-2017 03:42 PM

Hello- RF is kind of different than many other ML algorithms in that OOB is really a type of validation. So then the question is, exactly what value does the validation data have? My thinking is that I would get more value from my data by having one larger training dataset than two separate training and validation datasets. Of course, regardless, I would still have my test dataset. Thoughts?

Thanks, -Ted

PadraicGNeville · Posted 07-05-2017 01:27 PM

Re: more value from my data by having one larger training dataset than two separate training and validation datasets.

Yes, in the common situation where more training data is useful. Leo Breiman agreed.

Reasons to use validation data despite this:

1. In practice, OOB error rates are often biased on the conservative side. Error rates decrease with the number trees, at least initially. OOB error rates are based on about 1/3 of the trees for a specific data observation. Consequently, if the forest has 100 trees then the OOB error rate is closer to the test data error rate on 33 trees than on 100 trees.

2. OOB estimates are not directly comparible to other algorithms that use validation estimates. So how would one confirm that a forest is better than a neural network unless the forest is applied to the same validation data? Using the test set to select a model is risky if the data is easily overfit.

Hope this helps,

-Padraic

View solution in original post

PadraicGNeville · Posted 07-05-2017 01:27 PM