- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Many articles are saying you need adjust probability after over sampling for rare target. I am really confused here. I thought the purpose of oversampling is you believe you target subgroup is underrepresented so you just do some copy and paste work for rare target group. So you might end up with a higher probability when predicting for a given X variable. But if you are required to adjust your probability using the oversampled data by using the original odds and oversampled odds ratio, isn’t what you are doing is to revert everything back to the original status without oversampling? So why?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
> But if you are required to adjust your probability using the oversampled data by using the original odds and oversampled odds ratio, isn’t what you are doing is to revert everything back to the original status without oversampling?
No, because the predictive model has already been built then and the model building process has profited from oversampling the very rare target. I say deliberately "very rare" target, because many analysts are too quick to oversample I find. Oversampling should be used primarily to avoid too much input data and for the smoothness and speed of modelling as most techniques can perfectly deal with a rare outcome category.
Anyway you have to correct for the real priors (that seems logic to me) otherwise your predicted probabilities are not honest. Not correcting for the correct priors will still give you the correct ranking but your predicted probabilities for the rare category are artificially high. In general, you do not want the latter and you want the probabilities to be honest (such that they can really be interpreted as probabilities / likelihoods).
But, in order to reassure you, correcting for the real priors is not a complex analysis task. It's just an option in a procedure or checking a box in Enterprise Miner / Model Studio VDMML.
Kind regards,
Koen
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi, I am having the same scenario..i have oversampled my data and now im stuck on how to calibrate the predicted probabilities..kindly assist
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
How to adjust probability after oversampling
22601 - Adjusting for oversampling the event level in a binary logistic model (sas.com)
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Assume you have 1001 records, within 1001 rows 1000 of them are nonevent and 1 is event. So the ratio is 1000:1. You build a logistic regression model based on 1001 rows. You get logit(p/(1-p))=1+5*age. You use this model to score training data. For the event you get 0.1 probability
Now you use oversampling to boost rare target so you get 1000:1000 ratio for event vs nonevent. You build a model logit(p/(1-p))=1+200*age and you use this to score training data and for each event record you get probability let’s say 0.9. So this 0.9 is unadjusted probability and need to be adjusted according to original data proportion. So you do some math after adjusting probability you may get 0.1 probability as adjusted probability. So what is the point doing the whole thing? Please ignore the calculations above and only used for demonstrating.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Your model has a better chance of predicting the bad if there are more bad in the data set used to create the model.
You will get a different model fit, and different predicted probabilities, if your data set is 1000:1 versus 50% good and 50% bad.
Paige Miller