Solved
N/A
Posts: 1

# A Question on Modeling Rare Events Data

Hello,

I am building a logistic regression model in rare events data. Are you familiar with the methods to overcome the underestimation of the rare events?

From internet, I learn that prior correction and weighting methods might be useful. I also found those options in Enterprise Miner. Would you please tell me how to use those options?

Thank you.

Accepted Solutions
Solution
‎07-07-2017 01:50 PM
SAS Employee
Posts: 121

## Re: A Question on Modeling Rare Events Data

[ Edited ]

For categorical targets, SAS Enterprise Miner assigns each observation to the most likely (highest probability) outcome by default.  For a binary target, this is the same as the outcome with a probability higher than 0.50 (or 50%).   Suppose you have applied a prior to adjust the probabilities to reflect a population where the event rate is 5% and the non-event rate is 95%.  Using the same method to assess the outcome, you will likely see a great reduction in the number of predicted events since an observation will still need to have a probability higher than 50% (over 10 times the level of the background rate) to be predicted as having the target event.  This typically results in selecting fewer (or no) input variables.  Enterprise Miner has a way to account for this imbalance by adding additional 'weight' or value to the correct prediction of the target event.  In fact, there is a button in the Target Profile that allows you to request Default with Inverse Prior Weights.  This essentially assigns anyone who is more likely than average (higher than 5%) to be chosen as having the event.  Of course, this will greatly inflate the number of people classified as having the event but you should not be as concerned about the actual prediction class as the sort order of the probabilities.   You can always review the actual probability scores and determine a cutoff (e.g. using the Cutoff node) at which you want to make the prediction; however, you will allow more variables to be considered in helping you determine which observations are more likely than others, even in the case of a rare event.

It is easy to accomplish this task by following the instructions in Usage Note 47965: Using priors and decision weights in SAS® Enterprise Miner(tm)which is available at

You might also consider reviewing the paper Identifying and Overcoming Common Data Mining Mistake which is available at

where it has a discussion of handling target variable event levels occurring in different proportions on the bottom of page 6.

Implementing both the prior probabilities and the decisions weights as described above should provide you probabilities in the range that would be expected for your population and yield meaningful separation of your most likely and least likely observations.

All Replies
Contributor
Posts: 24

Contributor
Posts: 35

## A Question on Modeling Rare Events Data

When I model rare (1%) events, I use oversampling as described in "Mastering Data Mining" (Berry).  Basically, I train on a stratified sample so the rare event is 10-30% dense.  (I model in R, not in in SAS.)

SAS Employee
Posts: 122

## A Question on Modeling Rare Events Data

[ Edited ]

(I work for SAS. My remarks below are just my personal opinions on the subject)

1. One way is exact logistic regression, or other permutation based estimation.

2. 'Regular' logistic regression is known to produce biased estimates sometimes (I read a paper from American Statistics Association long time ago that when the % of 1 is <0.57, bias may occur. Forgot the title of the paper). This stems from the percentage of 1 is too low in the whole model universe. I believe when you say rare events, this is what you mean. In Enterprise Miner 6.2 (and later), when you drag a data set, you can customize in decision prior. And cost and profit and so on. This, however, perhaps is not the answer to your question. For example, your model universe has 0.6% =1 and 99.4%. You call it rare. If you believe the 'true' event (=1) percentage in the population (vs. the sample on your hand due to all kinds of constraints you had in collecting the sample) should be 5% (you know beforehand, that is what the word PRIOR means), you can specify that in Decision and Prior. That, in quantitative sense, makes your model closer to the real population. This reality addresses the bias in term of a healthy, normal percentage structure through reweighting the count of the target. It is, by definition, different from a good model sample that has original event =5%. Through the weighting, you enjoy the cost and benefit of artifically more homogeneous data.

3. In some sense, logistic regression (proc genmod is better than proc logistic in degree, but eventually similar shortcoming on the biasedness) is unfortunate tool for rare event modeling. In Enterprise Miner, look into Rule Induction for a possible better prediction tool. Rare events do  not necessarily imply insufficient event counts. In credit card fraud detection models, for example, you can have in your modeling universe over 15,000 fraund cases (event=1), but still rare since the % of <0.5. If your application has sensitive real time action requirement, consider dumping logistic regression entirely. Many applications in medical research or political sciences have rare event ratio where estimation / hypothesis is more critical then prediction, then permutation is the method since those samples are more likely to be fairly small, considering how faster you PC is running today.

Hope this helps.

Jason Xin

New Contributor
Posts: 3

## A Question on Modeling Rare Events Data

Hello Jason:

I think you can help me. I am using EM 6.2 to predict a crash severity event "target "1" however it is a rare event hance it is a challange.

My sample size is 1374 crash observations involving all injuriesand/or fatalities. The three data sets are:

• ALL (N=1374) with crashes including singleand two-vehicle collisions.
• SINGLE (N=500) for observations involving asingle vehicle.
• TWO (N=874) for two-vehicle collisions.

Each crash observations report an injury and/or fatality.The predictions models are intended to target the probability of having eithera serious injury either and/or a fatality outcome given any injury, (P (Severe Injury or Fatality | Injury)).The Target is defined as a binary variable (Target = 1 for an event and 0 for anon-event). It is an imbalanced sample with the following characteristics:

• 5% severe injury or fatalities in the entiredataset.
• 3.7% severe injury or fatalities in for two-vehiclecrash dataset.
• 7.6% severe injury or fatalities for singlevehicle crash dataset.

I am modeling the Target “1” which just happens 5% of thetime. Hence I did the oversampling and adjusting the priorities as follows.

1stOVERSAMPLING to include all the rare events (Target “1”)at the sample and equal number of Target “0”

-I add a sample node to the DataSource (the originalpopulation N=1374) in the new diagram (without partition node)

-At sample node property panel, I set: 100.0 percent,Level Based, Rarest Level, Level Proportion 100.0, and Sample proportion 50,0.

-I add a SCORE Node to the model selected by the bestmodel node

-I add new data source into the diagram, which is theoriginal population data table and the role set to “Score”

-I add SAS CODE node to the Score node

- I run the SAS Code and then I run the score node

2ndADJUSTING PROBABILITIES to predict the correctly the original distribution

-I add a DECISION node following the modeling node(select model)

At the decision node I set the prior probabilities as:

a) Level “1”, Count (70), Prior (0.5), Adjusted Prior(0.05)

b) Level “1”, Count (70), Prior (0.5), Adjusted Prior(0.95)

c) I applied the decisions by setting “yes” and I runthis node

Then, I run again thescore node at the diagram, as the results are below.

The event classification table at the Decision node showsthe following results:

FN (70), TN (70), FP (0) TP (0)

The score node results after applying the decision nodewith prior probabilities show the values:

Target “0” 99.6%

Target “1” 0.36%

These results do not make sense because in the originalpopulation the percent of Target “1” was 5%;

I didnt know ho to set the decison tab nor the cost, nor the weight????

Your advice for the best apporach to optimize my prediction model is very appreciated.

I look foward to hearing from YOU.

Regards

Mina

Solution
‎07-07-2017 01:50 PM
SAS Employee
Posts: 121

## Re: A Question on Modeling Rare Events Data

[ Edited ]

For categorical targets, SAS Enterprise Miner assigns each observation to the most likely (highest probability) outcome by default.  For a binary target, this is the same as the outcome with a probability higher than 0.50 (or 50%).   Suppose you have applied a prior to adjust the probabilities to reflect a population where the event rate is 5% and the non-event rate is 95%.  Using the same method to assess the outcome, you will likely see a great reduction in the number of predicted events since an observation will still need to have a probability higher than 50% (over 10 times the level of the background rate) to be predicted as having the target event.  This typically results in selecting fewer (or no) input variables.  Enterprise Miner has a way to account for this imbalance by adding additional 'weight' or value to the correct prediction of the target event.  In fact, there is a button in the Target Profile that allows you to request Default with Inverse Prior Weights.  This essentially assigns anyone who is more likely than average (higher than 5%) to be chosen as having the event.  Of course, this will greatly inflate the number of people classified as having the event but you should not be as concerned about the actual prediction class as the sort order of the probabilities.   You can always review the actual probability scores and determine a cutoff (e.g. using the Cutoff node) at which you want to make the prediction; however, you will allow more variables to be considered in helping you determine which observations are more likely than others, even in the case of a rare event.

It is easy to accomplish this task by following the instructions in Usage Note 47965: Using priors and decision weights in SAS® Enterprise Miner(tm)which is available at

You might also consider reviewing the paper Identifying and Overcoming Common Data Mining Mistake which is available at

where it has a discussion of handling target variable event levels occurring in different proportions on the bottom of page 6.

Implementing both the prior probabilities and the decisions weights as described above should provide you probabilities in the range that would be expected for your population and yield meaningful separation of your most likely and least likely observations.

☑ This topic is solved.