Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- Use of oversampling and cut-off node results inter...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-08-2013 03:10 AM

Hi All,

I have a dataset with target to non target (target variable is binary) proportion as 20:80 with some variables which are to be used in decision tree analysis. Count of data rows is around 20,000.

To start with, I ran the decision tree with no adjusted priors. However, I couldn't see any results from the run (no tree map with subsequent partitions, no cumulative lift chart etc). Hence I thought of trying out an oversampling such that the new proportion of target to non target is now 50:50. The decision tree this time threw up results unlike the earlier scenario.

My questions are:

1) Is there a reason why the tree output with no adjusted priors wasn't created ? I am using the default decision tree node settings itself.

2) How to decide on the cut-off (percent of Y or 1) for the overall tree, so that I can pick and choose the important leaf nodes? Or should it be be fixed at 50% (as I have oversampled data to 50:50 from 20:80) , such that all leaf nodes above 50% have higher probability of prediction for target = Y or 1.

3) Cut-off node usage: I read in a tech paper that cut-off node can be attached to any model node (where target is binary) and the actual cut-off can be determined from some tables of its results. Any leads on how these are being read. Is there a way to ascertain which cut-off other than 50% is better for my model.

4) Is there a choice between deciding a cut-off basis oversampling (pt # 2) or from the results of cut-off node ?

Kindly advise.

,

Thanks.

Accepted Solutions

Solution

Tuesday

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Tuesday

There are several issues involved here that need to be separated in order to provide a clearer understanding. For categorical target variables, by default SAS Enterprise Miner assigns the observation to the most likely target level based on the predicted value stored in a variable of the form

P_<target variable name><target variable level>

Using the SAMPSIO.HMEQ data set (which is available by clicking on **Help** --> **Generate Sample Data Sources...** inside Enterprise Miner and adding the **Home Equity** data) as an example, there is a categorical target variable named BAD which has levels 1 and 0. SAS Enterprise Miner generates several variables from any modeling node and in this case it would create the variables

P_BAD1 - the probability BAD=1

P_BAD0 - the probability BAD=0

These probabilities will reflect the training data by default so the probabilities of the rare event will be inflated if you oversample so that there is a higher proportion of observations with BAD=1 (the rare event) in the sample than there is in the population. If you are only concerned about the the predicted outcome, you can simply adjust the cutoff probabilities later using a Cutoff node to get the desired proportion of the data classified as events.

Should you be more interested in the actual probabilities themselves (rather than just the ordering of the observations from most likely to least likely) and wish to have the probability scores reflect values closer to the original population rather than the training data, you can accomplish this by creating a Target profile in the Input Data Source node. A Target profile allows you to adjust the prior probability and the weight/value attached to correctly predicting each outcome. Adjusting the prior probability for an oversampled target will adjust the probability scores to be centered closer to the overall population average you provide. Depending on which criteria you are using for choosing the model, it might also be useful to apply additional weight/value to correctly predicting the rare event.

By default, SAS Enterprise Miner defines two variables for a grouping variable target.

F_ <variable name>: the target level for each observation

I _ <variable name>: the predicted target level based on the fitted model (based on most likely outcome)

When you request to use decision weights in your target profile, SAS Enterprise Miner will create a decision variable of the form

D _ <target variable name>

with the predicted outcome based on choosing the most profitable (or least costly) outcome from the product of the predicted probability and the decision weight for each level. I_<variable name> and D_<variable name> provide reasonable approaches in many situations but in rare event scenarios, the I_ variable will likely predict too few people as having the event and the D_ variable will predict too many. As a result, I generally advise people to take their business objectives into consideration in order to choose a cutoff for their particular data set. Without specifying decisions weights, you might end up with a tree with no branches if none of the leaves represent a higher probability for the rare event. As a result, it is often helpful to specify your priors and decision weights.

It is easy to accomplish this task by following the instructions in Usage Note 47965: Using priors and decision weights in SAS® Enterprise Miner(tm) which is available at

http://support.sas.com/kb/47/965.html

where it shows the following:

/*** BEGIN USAGE NOTE 47965 EXCERPT ***/

Data mining problems routinely involve situations where one target level is more "rare" than others. By default, SAS Enterprise Miner assigns the most likely outcome as the predicted outcome. This assignment results in decision rules that strongly favor the common outcome, which is usually not of interest. The assignment often generates models with no predicted events of interest.

If you specify priors, then the posterior probabilities are adjusted, but the adjustment might lead to no variables selected. Even if a model is successfully fit, the predicted outcome might be the common target level.

Example: an event occurs 1% of the time. A person who is 10 times as likely to have the event, still has only a 10% chance of having the event. You can change this prediction outcome by modifying the default decision weights. Edit the default decision weights either in a Decisions node, or in an Input Data node.

To edit the default decision weights in the Input Data node, follow these steps:

- Click the Input Data node.
- Click the ellipsis (...) to the right of the Decisions property.
- Click Build to create a target profile.
- Click the Decisions tab.
- Click Default with Inverse Prior Weights. This selection enables you to find variables that are useful predictors.
- Click Decision Weights to see that the values changed from their default values.
- OK.

To determine the amount of weight to assign to the rare event in a binary target, calculate this ratio:

probability of the common event

ratio = ---------------------------------

probability of the rare event

Specify the weight on the rare event to be equal to this ratio. For example, if you have a binary event where Prob(Yes)=0.1 and Prob(No)=0.9, then the ratio of the common event to the rare event is 0.9/0.1 = 9. Change the weight for Yes from the default of 1 to the value 9 in the Decision Weights tab. If your rare event is much more rare, for example 2%, then the ratio is 0.98/0.2 = 49. If you have an event that occurs much less than 1% of the time, then you might get better results by over-sampling, and then adjusting the probabilities later. Even if you over-sample, the priors adjust the probabilities, but the predicted outcome is the common event (if you do not modify the decision weights).

The choice of the predicted-probability value to choose as the cutoff for predicting an event or non-event relies on business expertise. In the case of a rare event, it is common to focus only on the predictions in the small range of values for which action is taken. A model that always predicts that the event is the common outcome gives the outcome as often as the common event occurs in the data (example: 95% of the time). SAS Enterprise Miner provides an automated choice that is based on the decision weights that you provide. If these weights do not represent how you expect to implement the results, then focus on the ordering of the probabilities, choose your own threshold for action.

For more information, see the chapter "Predictive Modeling" in SAS Enterprise Miner Help.

Note: you might be able to apply this technique to a target variable that contains more than two levels. In that case, you need to specify how you want the levels to be weighted with respect to each other.

/*** END USAGE NOTE 47965 EXCERPT ***/

You might also consider reviewing the paper Identifying and Overcoming Common Data Mining Mistake which is available at

http://www2.sas.com/proceedings/forum2007/073-2007.pdf

where it has a discussion of handling target variable event levels occurring in different proportions on the bottom of page 6.

I hope this helps!

Doug

All Replies

Solution

Tuesday

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Tuesday

There are several issues involved here that need to be separated in order to provide a clearer understanding. For categorical target variables, by default SAS Enterprise Miner assigns the observation to the most likely target level based on the predicted value stored in a variable of the form

P_<target variable name><target variable level>

Using the SAMPSIO.HMEQ data set (which is available by clicking on **Help** --> **Generate Sample Data Sources...** inside Enterprise Miner and adding the **Home Equity** data) as an example, there is a categorical target variable named BAD which has levels 1 and 0. SAS Enterprise Miner generates several variables from any modeling node and in this case it would create the variables

P_BAD1 - the probability BAD=1

P_BAD0 - the probability BAD=0

These probabilities will reflect the training data by default so the probabilities of the rare event will be inflated if you oversample so that there is a higher proportion of observations with BAD=1 (the rare event) in the sample than there is in the population. If you are only concerned about the the predicted outcome, you can simply adjust the cutoff probabilities later using a Cutoff node to get the desired proportion of the data classified as events.

Should you be more interested in the actual probabilities themselves (rather than just the ordering of the observations from most likely to least likely) and wish to have the probability scores reflect values closer to the original population rather than the training data, you can accomplish this by creating a Target profile in the Input Data Source node. A Target profile allows you to adjust the prior probability and the weight/value attached to correctly predicting each outcome. Adjusting the prior probability for an oversampled target will adjust the probability scores to be centered closer to the overall population average you provide. Depending on which criteria you are using for choosing the model, it might also be useful to apply additional weight/value to correctly predicting the rare event.

By default, SAS Enterprise Miner defines two variables for a grouping variable target.

F_ <variable name>: the target level for each observation

I _ <variable name>: the predicted target level based on the fitted model (based on most likely outcome)

When you request to use decision weights in your target profile, SAS Enterprise Miner will create a decision variable of the form

D _ <target variable name>

with the predicted outcome based on choosing the most profitable (or least costly) outcome from the product of the predicted probability and the decision weight for each level. I_<variable name> and D_<variable name> provide reasonable approaches in many situations but in rare event scenarios, the I_ variable will likely predict too few people as having the event and the D_ variable will predict too many. As a result, I generally advise people to take their business objectives into consideration in order to choose a cutoff for their particular data set. Without specifying decisions weights, you might end up with a tree with no branches if none of the leaves represent a higher probability for the rare event. As a result, it is often helpful to specify your priors and decision weights.

It is easy to accomplish this task by following the instructions in Usage Note 47965: Using priors and decision weights in SAS® Enterprise Miner(tm) which is available at

http://support.sas.com/kb/47/965.html

where it shows the following:

/*** BEGIN USAGE NOTE 47965 EXCERPT ***/

Data mining problems routinely involve situations where one target level is more "rare" than others. By default, SAS Enterprise Miner assigns the most likely outcome as the predicted outcome. This assignment results in decision rules that strongly favor the common outcome, which is usually not of interest. The assignment often generates models with no predicted events of interest.

If you specify priors, then the posterior probabilities are adjusted, but the adjustment might lead to no variables selected. Even if a model is successfully fit, the predicted outcome might be the common target level.

Example: an event occurs 1% of the time. A person who is 10 times as likely to have the event, still has only a 10% chance of having the event. You can change this prediction outcome by modifying the default decision weights. Edit the default decision weights either in a Decisions node, or in an Input Data node.

To edit the default decision weights in the Input Data node, follow these steps:

- Click the Input Data node.
- Click the ellipsis (...) to the right of the Decisions property.
- Click Build to create a target profile.
- Click the Decisions tab.
- Click Default with Inverse Prior Weights. This selection enables you to find variables that are useful predictors.
- Click Decision Weights to see that the values changed from their default values.
- OK.

To determine the amount of weight to assign to the rare event in a binary target, calculate this ratio:

probability of the common event

ratio = ---------------------------------

probability of the rare event

Specify the weight on the rare event to be equal to this ratio. For example, if you have a binary event where Prob(Yes)=0.1 and Prob(No)=0.9, then the ratio of the common event to the rare event is 0.9/0.1 = 9. Change the weight for Yes from the default of 1 to the value 9 in the Decision Weights tab. If your rare event is much more rare, for example 2%, then the ratio is 0.98/0.2 = 49. If you have an event that occurs much less than 1% of the time, then you might get better results by over-sampling, and then adjusting the probabilities later. Even if you over-sample, the priors adjust the probabilities, but the predicted outcome is the common event (if you do not modify the decision weights).

The choice of the predicted-probability value to choose as the cutoff for predicting an event or non-event relies on business expertise. In the case of a rare event, it is common to focus only on the predictions in the small range of values for which action is taken. A model that always predicts that the event is the common outcome gives the outcome as often as the common event occurs in the data (example: 95% of the time). SAS Enterprise Miner provides an automated choice that is based on the decision weights that you provide. If these weights do not represent how you expect to implement the results, then focus on the ordering of the probabilities, choose your own threshold for action.

For more information, see the chapter "Predictive Modeling" in SAS Enterprise Miner Help.

Note: you might be able to apply this technique to a target variable that contains more than two levels. In that case, you need to specify how you want the levels to be weighted with respect to each other.

/*** END USAGE NOTE 47965 EXCERPT ***/

You might also consider reviewing the paper Identifying and Overcoming Common Data Mining Mistake which is available at

http://www2.sas.com/proceedings/forum2007/073-2007.pdf

where it has a discussion of handling target variable event levels occurring in different proportions on the bottom of page 6.

I hope this helps!

Doug