BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mmaccora
Obsidian | Level 7
Hi,

The problem is the following:

I have a data set of 500,000 obs. The goal is to predict an imbalanced binary target where 0.05% of the obs. are labelled as the minority class.

I would like to train a random forest with SAS Enterprise Miner. Is there a way to tell PROC HPFOREST to perform stratified random bootstrap sampling for each tree in order to be sure that event observations will be selected in the sample before building each tree ?

Thank you for your help,
Marco
1 ACCEPTED SOLUTION

Accepted Solutions
PadraicGNeville
SAS Employee

No, the bagged samples are simple random samples.

If the root node of a tree is not split for any reason, that tree is thrown out, a new sample is drawn, and splitting is attempted on that one.

The number of attempted trees is twice the number of requested trees.  For example, if 100 trees are requested, then up to 200 samples might be drawn to create a tree.

 

-Padraic

View solution in original post

3 REPLIES 3
PadraicGNeville
SAS Employee

No, the bagged samples are simple random samples.

If the root node of a tree is not split for any reason, that tree is thrown out, a new sample is drawn, and splitting is attempted on that one.

The number of attempted trees is twice the number of requested trees.  For example, if 100 trees are requested, then up to 200 samples might be drawn to create a tree.

 

-Padraic

mmaccora
Obsidian | Level 7
Thank you very much for this very clear answer.

This is bad new because this way of training uses a lot of ressources, often for nothing to get good results ..

Perhaps, do you know another way of doing random forest with stratified bootstrap ?

Thank you,
Marco
PadraicGNeville
SAS Employee

My only ideas are:

A. increase the in-bag-fraction to, say, .9 from .6.

B. randomly delete most observations from the dominant target class before running the forest.

A more elaborate B would average the predictions of, say, 10 forests of 10 trees, where each forest is trained with different randomly deleted observations from the dominant target class.

 

I will use your note to advocate adding stratified sampling in PROC FOREST, the SAS Viya PROC superseding PROC HPFOREST.

-Padraic

 

 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 2085 views
  • 1 like
  • 2 in conversation