Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Help with over/under sampling of the rare event in predictive modelling

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 16
Accepted Solution

Help with over/under sampling of the rare event in predictive modelling

Dear All

I have done a model which perfoms well when i oversample the rare event at 25% but gives too many False Positives. When i remove over sampling the missclassfication rate is good but the actual predicted and have actually churned number is too small to be used.

I need your help on what  can do

I am using SAS enterprise Miner

, Thanks


Accepted Solutions
Solution
yesterday
SAS Employee
Posts: 120

Re: Help with over/under sampling of the rare event in predictive modelling

The notions of lift and misclassification rate are both problematic in rare event scenarios.  

Consider Lift...

The maximum possible lift for a ...
     ... 50% overall response rate is   100% / 50% =  2
     ... 25% overall response rate is   100% / 25%  =  4 
     ... 10% overall response rate is   100% / 10% =   2
     ... 5% overall response rate is   100% /  5% =   20 
     ... 2% overall response rate is   100%  /  2%  =   50

which demonstrates that you only get dramatic sounding values for lift with rare events.   However, a lift of 5 (identifying a group with a rate five times the overall response rate) represents a 5% probability if the overall rate is 1% (a gain of 4%), but a lift of 2 represents a 20% probability if the overall response rate is 10% (a gain of 10%).   Therefore, you must be careful not to compare lift across models fit to populations with different overall response rates.

Similarly consider misclassification rates...

In a group of 1000 observations with a ...
     ... 50% overall response rate, you have 500 events or approximately 50/100 in each decile on average
     ... 25% overall response rate, you have 250 events or approximately 25/100 in each decile on average 
     ... 10% overall response rate, you have 100 events or approximately 10/100 in each decile on average 
     ...  5%  overall response rate, you have 50 events or approximately  5/100 in each decile on average 
     ...  2%  overall response rate, you have  20 events or approximately   2/100 in each decile on average 

Even an excellent fitting model in with a lift of 5 would still only have 5*2/100 or 10/100 events if the response rate is 2%.   If you choose to predict that all of those observations have the event, you will end up misclassifying 90% of them.   Compare this to the model where you classify nobody as having the event -- you now have correctly classified 98% of them.   As a result, modeling rare events almost always means your misclassification rate will be worse than the null (intercept only) model except in the most extreme cases (or never, in my experience).    

 

Your choice of cutoff should be made based on combining the probability scores with your business needs.   In one case, I had a customer that didn't want to reject claims so they only were considering the top couple percentile for rejection where there was still an extremely high chance of the claims being fraudulent.  In another case, the customer had limited exposure and didn't mind sending much deeper in the list even though there was a low chance of response since they only needed a 2% response to be profitable.  In the end, overall confusion matrices are not helpful for rare events.  It is often more important to focus on performance in the group on which you wish to take action (e.g. the top 3% or whatever) than focusing on overall statistics which are so wildly impacted by your oversampling rate.  


I hope this helps!

Doug 

View solution in original post


All Replies
Solution
yesterday
SAS Employee
Posts: 120

Re: Help with over/under sampling of the rare event in predictive modelling

The notions of lift and misclassification rate are both problematic in rare event scenarios.  

Consider Lift...

The maximum possible lift for a ...
     ... 50% overall response rate is   100% / 50% =  2
     ... 25% overall response rate is   100% / 25%  =  4 
     ... 10% overall response rate is   100% / 10% =   2
     ... 5% overall response rate is   100% /  5% =   20 
     ... 2% overall response rate is   100%  /  2%  =   50

which demonstrates that you only get dramatic sounding values for lift with rare events.   However, a lift of 5 (identifying a group with a rate five times the overall response rate) represents a 5% probability if the overall rate is 1% (a gain of 4%), but a lift of 2 represents a 20% probability if the overall response rate is 10% (a gain of 10%).   Therefore, you must be careful not to compare lift across models fit to populations with different overall response rates.

Similarly consider misclassification rates...

In a group of 1000 observations with a ...
     ... 50% overall response rate, you have 500 events or approximately 50/100 in each decile on average
     ... 25% overall response rate, you have 250 events or approximately 25/100 in each decile on average 
     ... 10% overall response rate, you have 100 events or approximately 10/100 in each decile on average 
     ...  5%  overall response rate, you have 50 events or approximately  5/100 in each decile on average 
     ...  2%  overall response rate, you have  20 events or approximately   2/100 in each decile on average 

Even an excellent fitting model in with a lift of 5 would still only have 5*2/100 or 10/100 events if the response rate is 2%.   If you choose to predict that all of those observations have the event, you will end up misclassifying 90% of them.   Compare this to the model where you classify nobody as having the event -- you now have correctly classified 98% of them.   As a result, modeling rare events almost always means your misclassification rate will be worse than the null (intercept only) model except in the most extreme cases (or never, in my experience).    

 

Your choice of cutoff should be made based on combining the probability scores with your business needs.   In one case, I had a customer that didn't want to reject claims so they only were considering the top couple percentile for rejection where there was still an extremely high chance of the claims being fraudulent.  In another case, the customer had limited exposure and didn't mind sending much deeper in the list even though there was a low chance of response since they only needed a 2% response to be profitable.  In the end, overall confusion matrices are not helpful for rare events.  It is often more important to focus on performance in the group on which you wish to take action (e.g. the top 3% or whatever) than focusing on overall statistics which are so wildly impacted by your oversampling rate.  


I hope this helps!

Doug 

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 1 reply
  • 274 views
  • 0 likes
  • 2 in conversation