Solved: Re: Stratified bootstrap sampling with random forest

mmaccora · Posted 10-24-2017 02:54 AM

Hi,

The problem is the following:

I have a data set of 500,000 obs. The goal is to predict an imbalanced binary target where 0.05% of the obs. are labelled as the minority class.

I would like to train a random forest with SAS Enterprise Miner. Is there a way to tell PROC HPFOREST to perform stratified random bootstrap sampling for each tree in order to be sure that event observations will be selected in the sample before building each tree ?

Thank you for your help,
Marco

PadraicGNeville · Posted 10-25-2017 09:37 AM

No, the bagged samples are simple random samples.

If the root node of a tree is not split for any reason, that tree is thrown out, a new sample is drawn, and splitting is attempted on that one.

The number of attempted trees is twice the number of requested trees. For example, if 100 trees are requested, then up to 200 samples might be drawn to create a tree.

-Padraic

View solution in original post

PadraicGNeville · Posted 10-25-2017 09:37 AM

No, the bagged samples are simple random samples.

If the root node of a tree is not split for any reason, that tree is thrown out, a new sample is drawn, and splitting is attempted on that one.

The number of attempted trees is twice the number of requested trees. For example, if 100 trees are requested, then up to 200 samples might be drawn to create a tree.

-Padraic

mmaccora · Posted 10-25-2017 12:00 PM

Thank you very much for this very clear answer.

This is bad new because this way of training uses a lot of ressources, often for nothing to get good results ..

Perhaps, do you know another way of doing random forest with stratified bootstrap ?

Thank you,
Marco

PadraicGNeville · Posted 10-25-2017 02:28 PM

My only ideas are:

A. increase the in-bag-fraction to, say, .9 from .6.

B. randomly delete most observations from the dominant target class before running the forest.

A more elaborate B would average the predictions of, say, 10 forests of 10 trees, where each forest is trained with different randomly deleted observations from the dominant target class.

I will use your note to advocate adding stratified sampling in PROC FOREST, the SAS Viya PROC superseding PROC HPFOREST.

-Padraic

Stratified bootstrap sampling with random forest

Re: Stratified bootstrap sampling with random forest

Re: Stratified bootstrap sampling with random forest

Re: Stratified bootstrap sampling with random forest

Re: Stratified bootstrap sampling with random forest

SAS Innovate 2025: Register Now