Solved: HP forest node with an imbalanced training set

ShaneMc · Posted 11-16-2015 05:11 AM

I’m working with the HP forest node using an imbalanced training set where the ratio between non-events to events is 6:1. I’m using approximately 60 trees and want the training data for each tree to be balanced 50:50 non-event: event.

Do I need to use a sample node to adjust the training data beforehand? or does the random forest node automatically select a balanced sample for each iteration/bag?

I’ve currently two models set up, the first using the pre-sampling approach (throwing away a large proportion of the non-event observations) and the second feeding in the imbalanced training set to the HP forest node. The second approach is giving me the best ROC/Lift on my holdout sample, therefore I’m guessing the HP forest node is doing something smart under the hood.

I’ve taken a look at the limited documentation and this is not covered unfortunately.

Any help would be greatly appreciated.

JasonXin · Posted 11-17-2015 01:53 PM

Hi, First, thanks for using SAS. My name is Jason Xin, advanced analytics solution architect working at SAS Institute. Your raw data with the 6 to 1 ratio is not really that imbalanced at all from predictive modeling perspective. The 'response rate' (% of 1 in the model universe) ranging anywhere from 40% to 0.5% is considered 'normal', 'not rare event' or 'just fine'. As matter of fact, your raw response rate of ~16% is very ideal for seeking lift from predictive models. If the raw 'response rate' is too low, one gets a great model. We may very well say, hen, the lower incoming response rate makes boosting performance easier. If the raw rate is kind of high, say, 35%, it will be challenging to have a model with great lift or ROC. An ideal response rate of 6 to 1 does not necessarily make it right, or true to your business on hand. The reality is constraints you have at collecting the data and/or assembling the model universe may very well be different from where and when you want to implement. In statistical term, 'sample' may very well does not reflect source population or target audience. This is typical, and quite frankly the only incentive aspect that drives you to adjust the sample. All the remarks I have made above are independent of random forest being the method you are tinkering. It is general model design practice. Now return to HP Forest (RF) procedure. Unlike HP Logistic where you have a Weight statement. Weighting is to tell a procedure treat one physical record as if the data set has many of it. In telling the event entries to follow one figure and telling non-event entries to follow another, you virtually alter effective count ration between YES and NO. But machine learning procedures / methods like RF builds models while splitting samples and finally assemble /vote them back. There is no practical way (this is not a SAS problem. This is everyone's problem) to trick down a weight quantity properly to subsamples after it is imposed on the whole of the model universe (like HPLOGISTIC). RF actually thrives on the target ratio 'being screwed' when it splits and builds, goes down and down. Return to your question. 1. if I were you, I would stop doing this entirely " the first using the pre-sampling approach (throwing away a large proportion of the non-event observations) ". If 50-50 is true to your business, you can randomly target this group. And then use response data from those random campaign to build model; if you have 50-50, a random toss should perform very close to if you have a model. 2. You can very well stick to your second practice, if you are comfortable with the 6 to 1 ratio being representative of your population. SAS has enabled random forest on high performance because with big data (tall tables and /or wide table (many variables and complex relationships), implementing RF will generally have the benefits in better model accuracy as you train deeper engaging more data. RF has built out of bagging facility so it is less prone to be over-fitting. Inside Proc HPFOREST, it does not automatically (and consciously) seek to balance, although as the tree splits randomly from the root, it may very well hit a ratio near 50-50. That is automatic, but coincidental. Hope this helps. Happy holiday. Thanks. Best Regards Jason Xin

View solution in original post

JasonXin · Posted 11-17-2015 01:53 PM

Hi, First, thanks for using SAS. My name is Jason Xin, advanced analytics solution architect working at SAS Institute. Your raw data with the 6 to 1 ratio is not really that imbalanced at all from predictive modeling perspective. The 'response rate' (% of 1 in the model universe) ranging anywhere from 40% to 0.5% is considered 'normal', 'not rare event' or 'just fine'. As matter of fact, your raw response rate of ~16% is very ideal for seeking lift from predictive models. If the raw 'response rate' is too low, one gets a great model. We may very well say, hen, the lower incoming response rate makes boosting performance easier. If the raw rate is kind of high, say, 35%, it will be challenging to have a model with great lift or ROC. An ideal response rate of 6 to 1 does not necessarily make it right, or true to your business on hand. The reality is constraints you have at collecting the data and/or assembling the model universe may very well be different from where and when you want to implement. In statistical term, 'sample' may very well does not reflect source population or target audience. This is typical, and quite frankly the only incentive aspect that drives you to adjust the sample. All the remarks I have made above are independent of random forest being the method you are tinkering. It is general model design practice. Now return to HP Forest (RF) procedure. Unlike HP Logistic where you have a Weight statement. Weighting is to tell a procedure treat one physical record as if the data set has many of it. In telling the event entries to follow one figure and telling non-event entries to follow another, you virtually alter effective count ration between YES and NO. But machine learning procedures / methods like RF builds models while splitting samples and finally assemble /vote them back. There is no practical way (this is not a SAS problem. This is everyone's problem) to trick down a weight quantity properly to subsamples after it is imposed on the whole of the model universe (like HPLOGISTIC). RF actually thrives on the target ratio 'being screwed' when it splits and builds, goes down and down. Return to your question. 1. if I were you, I would stop doing this entirely " the first using the pre-sampling approach (throwing away a large proportion of the non-event observations) ". If 50-50 is true to your business, you can randomly target this group. And then use response data from those random campaign to build model; if you have 50-50, a random toss should perform very close to if you have a model. 2. You can very well stick to your second practice, if you are comfortable with the 6 to 1 ratio being representative of your population. SAS has enabled random forest on high performance because with big data (tall tables and /or wide table (many variables and complex relationships), implementing RF will generally have the benefits in better model accuracy as you train deeper engaging more data. RF has built out of bagging facility so it is less prone to be over-fitting. Inside Proc HPFOREST, it does not automatically (and consciously) seek to balance, although as the tree splits randomly from the root, it may very well hit a ratio near 50-50. That is automatic, but coincidental. Hope this helps. Happy holiday. Thanks. Best Regards Jason Xin

ShaneMc · Posted 11-19-2015 07:22 AM

Thanks Jason for such a comprehensive answer – it’s really much appreciated.

Just one additional follow on question if I may, I’ve built a model using HP forest and I’m now trying to evaluate the variable importance.

In the variable importance table (within the HP forest results) a number of different metrics are captured including “Number of Splitting Rules”, “Train: Gini Reduction”, “Train: Margin Reduction , “OOB: Gini Reduction” and “OOB: Margin Reduction”.

I’m trying to find some SAS documentation on how these are calculated, for “OOB: Margin Reduction” I’m getting some negative values which is a little concerning. Is there any SAS documentation available?

Many thanks in advance.

HP forest node with an imbalanced training set

Re: HP forest node with an imbalanced training set

Re: HP forest node with an imbalanced training set

Re: HP forest node with an imbalanced training set