Hi,
First, thanks for using SAS. My name is Jason Xin, advanced analytics solution architect working at SAS Institute.
Your raw data with the 6 to 1 ratio is not really that imbalanced at all from predictive modeling perspective. The 'response rate' (% of 1 in the model universe) ranging anywhere from 40% to 0.5% is considered 'normal', 'not rare event' or 'just fine'. As matter of fact, your raw response rate of ~16% is very ideal for seeking lift from predictive models. If the raw 'response rate' is too low, one gets a great model. We may very well say, hen, the lower incoming response rate makes boosting performance easier. If the raw rate is kind of high, say, 35%, it will be challenging to have a model with great lift or ROC.
An ideal response rate of 6 to 1 does not necessarily make it right, or true to your business on hand. The reality is constraints you have at collecting the data and/or assembling the model universe may very well be different from where and when you want to implement. In statistical term, 'sample' may very well does not reflect source population or target audience. This is typical, and quite frankly the only incentive aspect that drives you to adjust the sample.
All the remarks I have made above are independent of random forest being the method you are tinkering. It is general model design practice.
Now return to HP Forest (RF) procedure. Unlike HP Logistic where you have a Weight statement. Weighting is to tell a procedure treat one physical record as if the data set has many of it. In telling the event entries to follow one figure and telling non-event entries to follow another, you virtually alter effective count ration between YES and NO. But machine learning procedures / methods like RF builds models while splitting samples and finally assemble /vote them back. There is no practical way (this is not a SAS problem. This is everyone's problem) to trick down a weight quantity properly to subsamples after it is imposed on the whole of the model universe (like HPLOGISTIC). RF actually thrives on the target ratio 'being screwed' when it splits and builds, goes down and down.
Return to your question.
1. if I were you, I would stop doing this entirely " the first using the pre-sampling approach (throwing away a large proportion of the non-event observations) ". If 50-50 is true to your business, you can randomly target this group. And then use response data from those random campaign to build model; if you have 50-50, a random toss should perform very close to if you have a model.
2. You can very well stick to your second practice, if you are comfortable with the 6 to 1 ratio being representative of your population.
SAS has enabled random forest on high performance because with big data (tall tables and /or wide table (many variables and complex relationships), implementing RF will generally have the benefits in better model accuracy as you train deeper engaging more data. RF has built out of bagging facility so it is less prone to be over-fitting.
Inside Proc HPFOREST, it does not automatically (and consciously) seek to balance, although as the tree splits randomly from the root, it may very well hit a ratio near 50-50. That is automatic, but coincidental.
Hope this helps. Happy holiday. Thanks.
Best Regards
Jason Xin
... View more