turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- Gradient Boosting is performing worse than random ...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-06-2014 03:58 PM

Hello SASers,

I am working on a project with a binary target. The target distribution is 13.6% (event) vs 86.4% (non-event). The decision tree, regression and gradient boosting models are scoring around a 19% missclassification on the validation data. I have two questions, but here are some of the details of my process flow:

I tried using inverse priors with models' assessment statistic set to decision, but switched to missclassification after I realized the models performed marginally better under this setting.

Data partition node is set to 70% (train) and 30% (validation).

I tried oversampling event case to 33% of the data, but the missclassification rate rose to 20%.

First question: If I oversample, does the 20% missclassification rate take into account that I oversampled (ie. the oversampling 20% misscalass. is worse than the non-oversampling 19% missclass)? OR is the oversampled 20% missclass better than the non-oversampled 19% b/c the oversampled event was observed in 33% of the observations and 20% is clearly an improvement?

Second question: Do y'all have any suggestions of what is casing the models to perform worse than random and how suggestions of how I may fix the problem?

Thank y'all so much for your time.

Best,

RWB

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-07-2014 09:56 AM

Oops, I made a rookie mistake. I calculated the distribution from the histograms derived from the explore variable process and I forgot to change my settings from (Top,Default) to (Random,Max). In actuality, the target distribution is around target distribution is 30% (event) vs 70% (non-event). So the model's are adding to our prediction power.

I'm still curious about the first question I asked above. I'll restate it:

First question: If I oversample, does the 20% missclassification rate take into account that I oversampled (ie. the oversampling 20% misscalass. is worse than the non-oversampling 19% missclass)? OR is the oversampled 20% missclass better than the non-oversampled 19% b/c the oversampled event was observed in 33% of the observations and 20% is clearly an improvement?

If y'all could help me solve this one, that would be great.

Thank you.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-14-2014 12:26 PM

No, oversampling is not being accounted for unless you adjust your prior probabilities and/or decision matrix, either in the Input Data node or a Decisions node after you have sampled. The "Detecting Rare Classes" section under Analytics > Predictive Modeling in the Enterprise Miner Reference Help provides the best practices for handling rare events.

Hope that helps,

Wendy Czika

SAS Enterprise Miner R&D

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-14-2014 01:46 PM

Thank you Wendy. I'm using inverse priors in the decision matrix, so would the miss classification rate of, lets say a decision tree take into account that the data is sampled. Here's the situation driving my question: In situations where I deal with rare events (event happens in 5% of data), I'll sometimes get a missclass. rate of lets say,15% on validation data. I then try oversampling (w/inverse priors of course), increasing the event proportion from 5% to (10%, or 20%, or 30%, ect.) and I end up getting missclass rates higher than the original 15%. Is there a way to compare against different subsampling proportions? SAS's training material usually suggests oversampling in situations of rare events, but I've been experiencing worse results when I do this.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-14-2014 03:27 PM

I'm unclear about what you are doing exactly when you say oversampling with inverse priors. If you are using the Sample node to sample a higher proportion of rare events, then you would need a Decisions node following it to adjust the prior probabilities. When using the same prior probabilities, it is valid to compare the models with different event proportions from oversampling. The "Prior Probabilities" section of the same part of the EM Reference Help that I mentioned above explains this better than I can!