I’m working with the HP forest node using an imbalanced training set where the ratio between non-events to events is 6:1. I’m using approximately 60 trees and want the training data for each tree to be balanced 50:50 non-event: event.
Do I need to use a sample node to adjust the training data beforehand? or does the random forest node automatically select a balanced sample for each iteration/bag?
I’ve currently two models set up, the first using the pre-sampling approach (throwing away a large proportion of the non-event observations) and the second feeding in the imbalanced training set to the HP forest node. The second approach is giving me the best ROC/Lift on my holdout sample, therefore I’m guessing the HP forest node is doing something smart under the hood.
I’ve taken a look at the limited documentation and this is not covered unfortunately.
Any help would be greatly appreciated.
Thanks Jason for such a comprehensive answer – it’s really much appreciated.
Just one additional follow on question if I may, I’ve built a model using HP forest and I’m now trying to evaluate the variable importance.
In the variable importance table (within the HP forest results) a number of different metrics are captured including “Number of Splitting Rules”, “Train: Gini Reduction”, “Train: Margin Reduction , “OOB: Gini Reduction” and “OOB: Margin Reduction”.
I’m trying to find some SAS documentation on how these are calculated, for “OOB: Margin Reduction” I’m getting some negative values which is a little concerning. Is there any SAS documentation available?
Many thanks in advance.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.