BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mmaccora
Obsidian | Level 7
Hi,

The problem is the following:

I have a data set of 500,000 obs. The goal is to predict an imbalanced binary target where 0.05% of the obs. are labelled as the minority class.

I would like to train a random forest with SAS Enterprise Miner. Is there a way to tell PROC HPFOREST to perform stratified random bootstrap sampling for each tree in order to be sure that event observations will be selected in the sample before building each tree ?

Thank you for your help,
Marco
1 ACCEPTED SOLUTION

Accepted Solutions
PadraicGNeville
SAS Employee

No, the bagged samples are simple random samples.

If the root node of a tree is not split for any reason, that tree is thrown out, a new sample is drawn, and splitting is attempted on that one.

The number of attempted trees is twice the number of requested trees.  For example, if 100 trees are requested, then up to 200 samples might be drawn to create a tree.

 

-Padraic

View solution in original post

3 REPLIES 3
PadraicGNeville
SAS Employee

No, the bagged samples are simple random samples.

If the root node of a tree is not split for any reason, that tree is thrown out, a new sample is drawn, and splitting is attempted on that one.

The number of attempted trees is twice the number of requested trees.  For example, if 100 trees are requested, then up to 200 samples might be drawn to create a tree.

 

-Padraic

mmaccora
Obsidian | Level 7
Thank you very much for this very clear answer.

This is bad new because this way of training uses a lot of ressources, often for nothing to get good results ..

Perhaps, do you know another way of doing random forest with stratified bootstrap ?

Thank you,
Marco
PadraicGNeville
SAS Employee

My only ideas are:

A. increase the in-bag-fraction to, say, .9 from .6.

B. randomly delete most observations from the dominant target class before running the forest.

A more elaborate B would average the predictions of, say, 10 forests of 10 trees, where each forest is trained with different randomly deleted observations from the dominant target class.

 

I will use your note to advocate adding stratified sampling in PROC FOREST, the SAS Viya PROC superseding PROC HPFOREST.

-Padraic

 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 2115 views
  • 1 like
  • 2 in conversation