turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- How to implement oversampling in Enterprise Miner?

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-25-2013 10:50 PM

when the success rate is small for logistic regression?

Thanks.

Accepted Solutions

Solution

02-26-2013
10:47 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-26-2013 10:47 AM

Mike,

There are several ways to implement oversampling in EM. The first step is to determine what flavor of oversampling you are after. Is it oversampling, undersampling, weighting of observations, duplication of rare events? This choice is influenced by many factors, including the proportion of rare events (is it 10%, 1% 0.1%...?) and how many observations you have. The ultimate goal is to have enough examples of your rare class to allow the model to identify meaningful patterns.

Under a typical scenario your target has a rare class, say 10%. If you had enough observations you could afford to oversample the rare class to 50%. You can do that using a sample node with the following properties: Size/Type=Percentage, Size/Percentage=100, Stratified/Criterion=Equal. This will result in a 50-50 sample where all of your rare events are used and only a sample of 0’s are chosen.

At this point you can already start running models, however all of you posterior probabilities and many performance metrics will not be reflecting the true priors. Still good to do model comparison and performance evaluation, as well as ranking of observations.

If you want your priors to be adjusted, then add a Decision Node (after data partition, for example). Under the Custom Editor add the real priors. This will prompt EM to adjust all of your posterior probabilities.

However, and this is something to be careful with, the Decision Node alone will NOT prompt EM to use the real priors as a cutoff value when choosing whether an observation is a 0 or a 1. In our example, even after using the Decision node, EM would use 0.5 as cutoff value.

In order to get the cutoff right, you need to go back to the decision node, go to the Decisions Tab and select Yes, then click Default to Inverse Prior Weights.

Under the Decision Weights tab, copy the value in the lower right corner to the lower left corner but add a minus in front of it. Replace the lower right corner with a 0. Just keep in mind that, even after all of this work, some metric (Misclassification in particular) will not reflect the actual priors. But the posterios will be right and the 0/1 decision will be right.

G

All Replies

Solution

02-26-2013
10:47 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

02-26-2013 10:47 AM

Mike,

There are several ways to implement oversampling in EM. The first step is to determine what flavor of oversampling you are after. Is it oversampling, undersampling, weighting of observations, duplication of rare events? This choice is influenced by many factors, including the proportion of rare events (is it 10%, 1% 0.1%...?) and how many observations you have. The ultimate goal is to have enough examples of your rare class to allow the model to identify meaningful patterns.

Under a typical scenario your target has a rare class, say 10%. If you had enough observations you could afford to oversample the rare class to 50%. You can do that using a sample node with the following properties: Size/Type=Percentage, Size/Percentage=100, Stratified/Criterion=Equal. This will result in a 50-50 sample where all of your rare events are used and only a sample of 0’s are chosen.

At this point you can already start running models, however all of you posterior probabilities and many performance metrics will not be reflecting the true priors. Still good to do model comparison and performance evaluation, as well as ranking of observations.

If you want your priors to be adjusted, then add a Decision Node (after data partition, for example). Under the Custom Editor add the real priors. This will prompt EM to adjust all of your posterior probabilities.

However, and this is something to be careful with, the Decision Node alone will NOT prompt EM to use the real priors as a cutoff value when choosing whether an observation is a 0 or a 1. In our example, even after using the Decision node, EM would use 0.5 as cutoff value.

In order to get the cutoff right, you need to go back to the decision node, go to the Decisions Tab and select Yes, then click Default to Inverse Prior Weights.

Under the Decision Weights tab, copy the value in the lower right corner to the lower left corner but add a minus in front of it. Replace the lower right corner with a 0. Just keep in mind that, even after all of this work, some metric (Misclassification in particular) will not reflect the actual priors. But the posterios will be right and the 0/1 decision will be right.

G

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-01-2013 04:58 AM

Great! Thanks.