Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- SAS Data Science
- /
- Re: How to implement oversampling in Enterprise Miner?

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 02-25-2013 10:50 PM
(15818 views)

when the success rate is small for logistic regression?

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Mike,

There are several ways to implement oversampling in EM. The first step is to determine what flavor of oversampling you are after. Is it oversampling, undersampling, weighting of observations, duplication of rare events? This choice is influenced by many factors, including the proportion of rare events (is it 10%, 1% 0.1%...?) and how many observations you have. The ultimate goal is to have enough examples of your rare class to allow the model to identify meaningful patterns.

Under a typical scenario your target has a rare class, say 10%. If you had enough observations you could afford to oversample the rare class to 50%. You can do that using a sample node with the following properties: Size/Type=Percentage, Size/Percentage=100, Stratified/Criterion=Equal. This will result in a 50-50 sample where all of your rare events are used and only a sample of 0’s are chosen.

At this point you can already start running models, however all of you posterior probabilities and many performance metrics will not be reflecting the true priors. Still good to do model comparison and performance evaluation, as well as ranking of observations.

If you want your priors to be adjusted, then add a Decision Node (after data partition, for example). Under the Custom Editor add the real priors. This will prompt EM to adjust all of your posterior probabilities.

However, and this is something to be careful with, the Decision Node alone will NOT prompt EM to use the real priors as a cutoff value when choosing whether an observation is a 0 or a 1. In our example, even after using the Decision node, EM would use 0.5 as cutoff value.

In order to get the cutoff right, you need to go back to the decision node, go to the Decisions Tab and select Yes, then click Default to Inverse Prior Weights.

Under the Decision Weights tab, copy the value in the lower right corner to the lower left corner but add a minus in front of it. Replace the lower right corner with a 0. Just keep in mind that, even after all of this work, some metric (Misclassification in particular) will not reflect the actual priors. But the posterios will be right and the 0/1 decision will be right.

G

2 REPLIES 2

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Mike,

There are several ways to implement oversampling in EM. The first step is to determine what flavor of oversampling you are after. Is it oversampling, undersampling, weighting of observations, duplication of rare events? This choice is influenced by many factors, including the proportion of rare events (is it 10%, 1% 0.1%...?) and how many observations you have. The ultimate goal is to have enough examples of your rare class to allow the model to identify meaningful patterns.

Under a typical scenario your target has a rare class, say 10%. If you had enough observations you could afford to oversample the rare class to 50%. You can do that using a sample node with the following properties: Size/Type=Percentage, Size/Percentage=100, Stratified/Criterion=Equal. This will result in a 50-50 sample where all of your rare events are used and only a sample of 0’s are chosen.

At this point you can already start running models, however all of you posterior probabilities and many performance metrics will not be reflecting the true priors. Still good to do model comparison and performance evaluation, as well as ranking of observations.

If you want your priors to be adjusted, then add a Decision Node (after data partition, for example). Under the Custom Editor add the real priors. This will prompt EM to adjust all of your posterior probabilities.

However, and this is something to be careful with, the Decision Node alone will NOT prompt EM to use the real priors as a cutoff value when choosing whether an observation is a 0 or a 1. In our example, even after using the Decision node, EM would use 0.5 as cutoff value.

In order to get the cutoff right, you need to go back to the decision node, go to the Decisions Tab and select Yes, then click Default to Inverse Prior Weights.

Under the Decision Weights tab, copy the value in the lower right corner to the lower left corner but add a minus in front of it. Replace the lower right corner with a 0. Just keep in mind that, even after all of this work, some metric (Misclassification in particular) will not reflect the actual priors. But the posterios will be right and the 0/1 decision will be right.

G

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Great! Thanks.

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. **Registration is now open through August 30th**. Visit the SAS Hackathon homepage.

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.