We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Tip: How to model a rare target using an oversample approach in SAS® Enterprise Miner™

by Super Contributor on ‎02-06-2015 01:11 PM - edited on ‎10-11-2016 04:03 PM by SAS Super FREQ (9,964 Views)

Even today in the Big Data era, it is still a frequent challenge for data miners to train a predictive model for data sets with a rare or relatively low count of events on your target variable.

 

 

first posted this question in . There are a few ways to do it and the reply to this post https://communities.sas.com/t5/SAS-Data-Mining/How-to-implement-oversampling-in-Enterprise-Miner/td-...

gave a fantastic answer on how to model a rare target event using oversample in Enterprise Miner. I summarized this approach in this example for you to both simulate a data set with a rare target event and try out the oversample or balanced sampling approach. If balanced sampling is impractical in your case, SAS® Enterprise Miner™ Reference Help describes other five methods to deal with rare target events in the section Detecting Rare Classes.

 

 

Simulate a data set with a rare target event

If you don’t have handy a data set with a rare target event, you can use this first part of the diagram to transform the German Credit data set into a rare target mockup version.

1_simulate_rare_target.png


On your Sample node specify:

 

Property Value In plain English
Criterion Level Based You want a stratified sample based on one of the levels of your target.
Level Selection Rarest Level You want to specify sample and level proportion for the rarest level on your target variable. For this example it is a great shortcut to specify rarest level since you know that you have less bads (events) than goods in your data. Otherwise you need to make sure that bad is the specified event for your data set.
Level Proportion 10 To mock up a rare target, you only want to keep 10% of the bads, which are the rarest level you specified in the level selection.
Sample Proportion 5 Coming out of your Sample node you want the proportion of bads to goods so that you have 5 bads for each 95  goods. Feel free to experiment with lower sample proportions.

 

Run this first part of the flow and notice from the Sample node results that the original data had 300 events and a proportion of 30 bads for each 70 goods. After your sample node now you have 10% of the original bads (30), and a proportion of 5 bads for each 95 goods.

 

01_sample_node_results.png

 

 

Oversample and train a model

 

Now that you have a data set with a rare target event, let's oversample. To do that use another Sample node and specify Type as Percentage, Percentage as 100.0, and Criterion as Equal. This will give you a data set that has all your events and a random sample of your non-events. For this example you kept the 30 bads (events) and only a random sample of 30 out of the 570 goods (non-events) in the input data set.

 

02_oversample_proportions.png

 

 

Next step you need to add a Data Partition node with 70% for training and 30% for validation. Then add a Decisions node to specify the correct decision consequences. In your Decisions node, specify Apply Decisions as Yes, Decisions as Custom, then open the Custom Editor. On the Decision Weights tab, you will want to enter the inverse priors based on the "original" proportion of rare events, 0.05.  So your decision matrix should look like this:

DecisionWeights.png

 

 

 

Add some models and a Model Comparison node to select the best one. For this example I used a logistic regression, a logistic regression with stepwise selection, and a gradient boosting. Your diagram should look like below. Feel free to add more model nodes.

 

3_oversample_diagram.png


When you specify Apply Decisions=Yes using a Decisions node or directly on your data source node, the Model Comparison node selects the best model according to average profit. This is the statistic you want to use for oversampled data sets because none of your other fit statistics like misclassification or mean square error are adjusted in this case (NOTE: you could have alternatively entered the true prior probabilities in the Decision node to have the posterior probabilities adjusted accordingly, then these statistics would be valid). 

 

Note from the results that for this example the Model Comparison  selected the gradient boosting model because it has slightly better average profit.

Average profit is calculated depending on your target and your model. For a binary target and a decision tree model it is calculated as:

 

 

Expected profit=Posterior_probability_non_event *corresponding_value_on_profit_matrix +posterior_probability_of_event *corresponding_value_on_profit_matrix

 

 

Find more details in the Reference Help under the Decisions section of the Predictive Modeling chapter.

  

Since there is more than one way in Enterprise Miner to do this, I am curious here to know how you approach modeling data sets with rare target events. How would you model this?


I hope you find this useful. All comments welcome!

 

Good luck!

-Miguel

 

Comments
by SAS Employee gcjfernandez_gmail_com
on ‎05-21-2015 03:48 PM

average profit.pngIn SAS EM, the expected profit is used to compute the decision threshold where as the average computed profit is used as an assessment statistics in model comparison and selection. The formula for computing average profit is different from computing the expected profit. Please see the details of average profit computation from SAS EM help:

by Occasional Contributor sathya66
on ‎11-03-2016 07:14 AM

Hi,

I have rare Linear targets in my data.( 3000000 obeservations ,1800 obesrvations are in linear and remaining obesrvations are zeros)

How can I oversample the data for interval target ( I am using Linear Regression) or please can you suggest me the procedure to build a model.

thanks,

sathya.

by Occasional Contributor jlh368
2 weeks ago

Hi, the diagram attached differs from the instructions listed above.  The Decision node in the diagram under Train shows the following. Decisions set to Property (Custom in the example above) and Matrix is set to inverse priors. Following along, I switched this to custom and checked the decision matrix. This is set to "Do you want to use decisions?" Yes and has decision weights 1.428/0, 0/3.33.  These would correspond to the weights of the original data set and not the first sample.

Your turn
Sign In!

Want to write an article? Sign in with your profile.