BookmarkSubscribeRSS Feed
Analyze_this
Calcite | Level 5

Hello SASers,

I am working on a project with a binary target.  The target distribution is 13.6% (event) vs 86.4% (non-event).  The decision tree, regression and gradient boosting models are scoring around a 19% missclassification on the validation data.  I have two questions, but here are some of the details of my process flow:

I tried using inverse priors with models' assessment statistic set to decision, but switched to missclassification after I realized the models performed marginally better under this setting.

Data partition node is set to 70% (train) and 30% (validation).

I tried oversampling event case to 33% of the data, but the missclassification rate rose to 20%.

First question:  If I oversample, does the 20% missclassification rate take into account that I oversampled (ie. the oversampling 20% misscalass. is worse than the non-oversampling 19% missclass)?  OR is the oversampled 20% missclass better than the non-oversampled 19% b/c the oversampled event was observed in 33% of the observations and 20% is clearly an improvement?  

Second question: Do y'all have any suggestions of what is casing the models to perform worse than random and how suggestions of how I may fix the problem? 

Thank y'all so much for your time.

Best,

RWB

4 REPLIES 4
Analyze_this
Calcite | Level 5

Oops, I made a rookie mistake.  I calculated the distribution from the histograms derived from the explore variable process and I forgot to change my settings from (Top,Default) to (Random,Max).  In actuality, the target distribution is around target distribution is 30% (event) vs 70% (non-event).  So the model's are adding to our prediction power.

I'm still curious about the first question I asked above.  I'll restate it:

First question:  If I oversample, does the 20% missclassification rate take into account that I oversampled (ie. the oversampling 20% misscalass. is worse than the non-oversampling 19% missclass)?  OR is the oversampled 20% missclass better than the non-oversampled 19% b/c the oversampled event was observed in 33% of the observations and 20% is clearly an improvement?


If y'all could help me solve this one, that would be great.


Thank you.



WendyCzika
SAS Employee

No, oversampling is not being accounted for unless you adjust your prior probabilities and/or decision matrix, either in the Input Data node or a Decisions node after you have sampled.  The "Detecting Rare Classes" section under Analytics > Predictive Modeling in the Enterprise Miner Reference Help provides the best practices for handling rare events.

Hope that helps,

Wendy Czika

SAS Enterprise Miner R&D

Analyze_this
Calcite | Level 5

Thank you Wendy.  I'm using inverse priors in the decision matrix, so would the miss classification rate of, lets say a decision tree take into account that the data is sampled.  Here's the situation driving my question:  In situations where I deal with rare events (event happens in 5% of data), I'll sometimes get a missclass. rate of lets say,15% on validation data.  I then try oversampling (w/inverse priors of course), increasing the event proportion from 5% to (10%, or 20%, or 30%, ect.) and I end up getting missclass rates higher than the original 15%.  Is there a way to compare against different subsampling proportions?  SAS's training material usually suggests oversampling in situations of rare events, but I've been experiencing worse results when I do this.

WendyCzika
SAS Employee

I'm unclear about what you are doing exactly when you say oversampling with inverse priors.  If you are using the Sample node to sample a higher proportion of rare events, then you would need a Decisions node following it to adjust the prior probabilities.  When using the same prior probabilities, it is valid to compare the models with different event proportions from oversampling.  The "Prior Probabilities" section of the same part of the EM Reference Help that I mentioned above explains this better than I can!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1405 views
  • 3 likes
  • 2 in conversation