Many articles are saying you need adjust probability after over sampling for rare target. I am really confused here. I thought the purpose of oversampling is you believe you target subgroup is underrepresented so you just do some copy and paste work for rare target group. So you might end up with a higher probability when predicting for a given X variable. But if you are required to adjust your probability using the oversampled data by using the original odds and oversampled odds ratio, isn’t what you are doing is to revert everything back to the original status without oversampling? So why?
Hello,
> But if you are required to adjust your probability using the oversampled data by using the original odds and oversampled odds ratio, isn’t what you are doing is to revert everything back to the original status without oversampling?
No, because the predictive model has already been built then and the model building process has profited from oversampling the very rare target. I say deliberately "very rare" target, because many analysts are too quick to oversample I find. Oversampling should be used primarily to avoid too much input data and for the smoothness and speed of modelling as most techniques can perfectly deal with a rare outcome category.
Anyway you have to correct for the real priors (that seems logic to me) otherwise your predicted probabilities are not honest. Not correcting for the correct priors will still give you the correct ranking but your predicted probabilities for the rare category are artificially high. In general, you do not want the latter and you want the probabilities to be honest (such that they can really be interpreted as probabilities / likelihoods).
But, in order to reassure you, correcting for the real priors is not a complex analysis task. It's just an option in a procedure or checking a box in Enterprise Miner / Model Studio VDMML.
Kind regards,
Koen
Hi, I am having the same scenario..i have oversampled my data and now im stuck on how to calibrate the predicted probabilities..kindly assist
How to adjust probability after oversampling
22601 - Adjusting for oversampling the event level in a binary logistic model (sas.com)
Your model has a better chance of predicting the bad if there are more bad in the data set used to create the model.
You will get a different model fit, and different predicted probabilities, if your data set is 1000:1 versus 50% good and 50% bad.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.