12-30-2016 05:12 AM
I am trying to understand how to account for oversampling in Enterprise Miner - I did find and read some related topics, however I am still at a loss.
I am trying to predict a binary event that has a probability of 4%. I've went through the following three scenarios:
1. Oversample the data in SAS BASE so that the event (TARGET=1) is 30% of the sample. Did not create any decisions in EM, ran a HP Forest and got some decent results on the test / validation data sets.
2. Oversample in SAS BASE to 30%, then setup decisions in EM: in the Prior Probabilities tab (Data Source Wizard, if that matters) I put 0.04 for Level 1 and 0.96 for Level 0 in Adjusted Priors. The HOP Forest Model produced some bizarre results - all predictions were TARGET = 0, so not a single case with TARGET=1 was predicted correctly. Moreover, the ROC index for test data was 0.88 and the Cumulative Lift for the 10th percentile was 6.2 - these values seems to indicate a good model, which is obviously not the case,
3. No oversampling done, no decisions. I got the exact same results as point 2 above - what's the point of prior probabilities then ?
What am I doing wrong? I thought scenario 2 would be the way to go, but I am making a mistake and I can't figure out what it is.
01-02-2017 08:46 PM
Short answer (it's very late over here) ...
You can do the oversampling in Enterprise Miner with the Sampling node. No need to sample upfront with base SAS.
All your predictions are zero (0). That's because EMiner uses 0.5 (50%) as default cut-off value for binary classification but you don't have any posterior probability higher than 0.5 after adjusting for the real priors.
Use the Decision node to enter a *complete* target profile:
* Enter the correct prior probabilities (as you did)
* go to the Decisions Tab and select Yes, then click Default to Inverse Prior Weights. This will move your cut-off from 50% to 4% (your prior probability for an event). See the Decision Weights tab for the resulting matrix.
The ROC index and the Cumulative Lift for the xxth percentile are solely based on the ranking of the observations (by descending posterior probability). Adjusting the posterior probabilities for the real priors is just a downward adjustment of the posterior probabilities to make them 'honest'. The ranking of the predicted observations does not change by this operation, hence the ROC index and the Cumulative Lift for the xxth percentile stay the same.
Note: there is a cut-off node for finding better cut-offs than 4% (your prior probability for an event)! It may help in finding a better balance between True Positive Rate (Sensitivity) and Precision.
Hope this helps,
01-03-2017 11:29 AM
Here's somewhat more info on the cut-off node and how to use it.
SAS Global Forum 2012
Use of Cutoff and SAS Code Nodes in SAS® Enterprise Miner™ to Determine Appropriate Probability Cutoff Point for Decision Making with Binary Target Models
Yogen Shah, Oklahoma State University, Stillwater, OK