BookmarkSubscribeRSS Feed
gyambqt
Obsidian | Level 7

Many articles are saying you need adjust probability after over sampling for rare target. I am really confused here. I thought the purpose of oversampling is you believe you target subgroup is underrepresented so you just do some copy and paste work for rare target group. So you might end up with a higher probability when predicting for a given X variable. But if you are required to adjust your probability using the oversampled data by using the original odds and oversampled odds ratio, isn’t what you are doing is to revert everything back to the original status without oversampling? So why?

5 REPLIES 5
sbxkoenk
SAS Super FREQ

Hello,

 

> But if you are required to adjust your probability using the oversampled data by using the original odds and oversampled odds ratio, isn’t what you are doing is to revert everything back to the original status without oversampling? 

 

No, because the predictive model has already been built then and the model building process has profited from oversampling the very rare target. I say deliberately "very rare" target, because many analysts are too quick to oversample I find. Oversampling should be used primarily to avoid too much input data and for the smoothness and speed of modelling as most techniques can perfectly deal with a rare outcome category.

 

Anyway you have to correct for the real priors (that seems logic to me) otherwise your predicted probabilities are not honest. Not correcting for the correct priors will still give you the correct ranking but your predicted probabilities for the rare category are artificially high. In general, you do not want the latter and you want the probabilities to be honest (such that they can really be interpreted as probabilities / likelihoods).

 

But, in order to reassure you, correcting for the real priors is not a complex analysis task. It's just an option in a procedure or checking a box in Enterprise Miner / Model Studio VDMML.

 

Kind regards,

Koen

Solly7
Pyrite | Level 9

Hi, I am having the same scenario..i have oversampled my data and now im stuck on how to calibrate the predicted probabilities..kindly assist

gyambqt
Obsidian | Level 7
Hi let‘s consider a extreme case.
Assume you have 1001 records, within 1001 rows 1000 of them are nonevent and 1 is event. So the ratio is 1000:1. You build a logistic regression model based on 1001 rows. You get logit(p/(1-p))=1+5*age. You use this model to score training data. For the event you get 0.1 probability
Now you use oversampling to boost rare target so you get 1000:1000 ratio for event vs nonevent. You build a model logit(p/(1-p))=1+200*age and you use this to score training data and for each event record you get probability let’s say 0.9. So this 0.9 is unadjusted probability and need to be adjusted according to original data proportion. So you do some math after adjusting probability you may get 0.1 probability as adjusted probability. So what is the point doing the whole thing? Please ignore the calculations above and only used for demonstrating.
PaigeMiller
Diamond | Level 26

Your model has a better chance of predicting the bad if there are more bad in the data set used to create the model.

 

You will get a different model fit, and different predicted probabilities, if your data set is 1000:1 versus 50% good and 50% bad.

--
Paige Miller

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1613 views
  • 0 likes
  • 4 in conversation