Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- A Question on Modeling Rare Events Data

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-07-2011 04:45 PM

Hello,

I am building a logistic regression model in rare events data. Are you familiar with the methods to overcome the underestimation of the rare events?

From internet, I learn that prior correction and weighting methods might be useful. I also found those options in Enterprise Miner. Would you please tell me how to use those options?

Thank you.

I am building a logistic regression model in rare events data. Are you familiar with the methods to overcome the underestimation of the rare events?

From internet, I learn that prior correction and weighting methods might be useful. I also found those options in Enterprise Miner. Would you please tell me how to use those options?

Thank you.

Accepted Solutions

Solution

07-07-2017
01:50 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-07-2017 01:48 PM - edited 07-07-2017 01:51 PM

For categorical targets, SAS Enterprise Miner assigns each observation to the most likely (highest probability) outcome** by default**. For a binary target, this is the same as the outcome with a probability higher than 0.50 (or 50%). Suppose you have applied a prior to adjust the probabilities to reflect a population where the event rate is 5% and the non-event rate is 95%. Using the same method to assess the outcome, you will likely see a great reduction in the number of predicted events since an observation will still need to have a probability higher than 50% (over 10 times the level of the background rate) to be predicted as having the target event. This typically results in selecting fewer (or no) input variables. Enterprise Miner has a way to account for this imbalance by adding additional 'weight' or value to the correct prediction of the target event. In fact, there is a button in the Target Profile that allows you to request

It is easy to accomplish this task by following the instructions in Usage Note 47965: Using priors and decision weights in SAS® Enterprise Miner(tm)which is available at

http://support.sas.com/kb/47/965.html

You might also consider reviewing the paper Identifying and Overcoming Common Data Mining Mistake which is available at

http://www2.sas.com/proceedings/forum2007/073-2007.pdf

where it has a discussion of handling target variable event levels occurring in different proportions on the bottom of page 6.

Implementing both the prior probabilities and the decisions weights as described above should provide you probabilities in the range that would be expected for your population and yield meaningful separation of your most likely and least likely observations.

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-26-2011 12:14 PM

Hi,

Try this.

www.ecu.edu/cs-dhs/bios/upload/Logistic.pdf

http://www.talkstats.com/showthread.php/11801-Modeling-Rare-Event-Data

http://www.mathkb.com/Uwe/Forum.aspx/sas/12662/modeling-rare-event-rate

http://blog.data-miners.com/2008/05/adjusting-for-oversampling.html

Regards,

Murphy

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-15-2011 11:42 AM

When I model rare (1%) events, I use oversampling as described in "Mastering Data Mining" (Berry). Basically, I train on a stratified sample so the rare event is 10-30% dense. (I model in R, not in in SAS.)

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-23-2011 10:58 AM - last edited on 07-07-2017 01:32 PM by DougWielenga

(I work for SAS. My remarks below are just my personal opinions on the subject)

1. One way is exact logistic regression, or other permutation based estimation.

2. 'Regular' logistic regression is known to produce biased estimates sometimes (I read a paper from American Statistics Association long time ago that when the % of 1 is <0.57, bias may occur. Forgot the title of the paper). This stems from the percentage of 1 is too low in the whole model universe. I believe when you say rare events, this is what you mean. In Enterprise Miner 6.2 (and later), when you drag a data set, you can customize in decision prior. And cost and profit and so on. This, however, perhaps is not the answer to your question. For example, your model universe has 0.6% =1 and 99.4%. You call it rare. If you believe the 'true' event (=1) percentage in the population (vs. the sample on your hand due to all kinds of constraints you had in collecting the sample) should be 5% (you know beforehand, that is what the word PRIOR means), you can specify that in Decision and Prior. That, in quantitative sense, makes your model closer to the real population. This reality addresses the bias in term of a healthy, normal percentage structure through reweighting the count of the target. It is, by definition, different from a good model sample that has original event =5%. Through the weighting, you enjoy the cost and benefit of artifically more homogeneous data.

3. In some sense, logistic regression (proc genmod is better than proc logistic in degree, but eventually similar shortcoming on the biasedness) is unfortunate tool for rare event modeling. In Enterprise Miner, look into Rule Induction for a possible better prediction tool. Rare events do not necessarily imply insufficient event counts. In credit card fraud detection models, for example, you can have in your modeling universe over 15,000 fraund cases (event=1), but still rare since the % of <0.5. If your application has sensitive real time action requirement, consider dumping logistic regression entirely. Many applications in medical research or political sciences have rare event ratio where estimation / hypothesis is more critical then prediction, then permutation is the method since those samples are more likely to be fairly small, considering how faster you PC is running today.

Hope this helps.

Jason Xin

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-22-2012 11:39 AM

Hello Jason:

I think you can help me. I am using EM 6.2 to predict a crash severity event "target "1" however it is a rare event hance it is a challange.

My sample size is 1374 crash observations involving all injuriesand/or fatalities. The three data sets are:

- ALL (N=1374) with crashes including singleand two-vehicle collisions.
- SINGLE (N=500) for observations involving asingle vehicle.
- TWO (N=874) for two-vehicle collisions.

Each crash observations report an injury and/or fatality.The predictions models are intended to target the probability of having eithera serious injury either and/or a fatality outcome given any injury, (**P (Severe Injury or Fatality | Injury)**).The Target is defined as a binary variable (Target = 1 for an event and 0 for anon-event). It is an imbalanced sample with the following characteristics:

- 5% severe injury or fatalities in the entiredataset.
- 3.7% severe injury or fatalities in for two-vehiclecrash dataset.
- 7.6% severe injury or fatalities for singlevehicle crash dataset.

I am modeling the Target “1” which just happens 5% of thetime. Hence I did the oversampling and adjusting the priorities as follows.

**1 ^{st}OVERSAMPLING** to include all the rare events (Target “1”)at the sample and equal number of Target “0”

-I add a sample node to the DataSource (the originalpopulation N=1374) in the new diagram (without partition node)

-At sample node property panel, I set: 100.0 percent,Level Based, Rarest Level, Level Proportion 100.0, and Sample proportion 50,0.

-I add LOGISTIC REGRESSION nodes

-I add MODEL COMPARISON node

-I add a SCORE Node to the model selected by the bestmodel node

-I add new data source into the diagram, which is theoriginal population data table and the role set to “Score”

-I add SAS CODE node to the Score node

- I run the SAS Code and then I run the score node

**2 ^{nd}ADJUSTING PROBABILITIES to predict the correctly the original distribution**

-I add a DECISION node following the modeling node(select model)

At the decision node I set the prior probabilities as:

a) Level “1”, Count (70), Prior (0.5), Adjusted Prior(0.05)

b) Level “1”, Count (70), Prior (0.5), Adjusted Prior(0.95)

c) I applied the decisions by setting “yes” and I runthis node

Then, **I run again thescore node** at the diagram, as the results are below.

The event classification table at the Decision node showsthe following results:

FN (70), TN (70), FP (0) TP (0)

The score node results after applying the decision nodewith prior probabilities show the values:

Target “0” 99.6%

Target “1” 0.36%

These results do not make sense because in the originalpopulation the percent of Target “1” was 5%;

I didnt know ho to set the decison tab nor the cost, nor the weight????

Your advice for the best apporach to optimize my prediction model is very appreciated.

I look foward to hearing from YOU.

Regards

Mina

Solution

07-07-2017
01:50 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-07-2017 01:48 PM - edited 07-07-2017 01:51 PM

For categorical targets, SAS Enterprise Miner assigns each observation to the most likely (highest probability) outcome** by default**. For a binary target, this is the same as the outcome with a probability higher than 0.50 (or 50%). Suppose you have applied a prior to adjust the probabilities to reflect a population where the event rate is 5% and the non-event rate is 95%. Using the same method to assess the outcome, you will likely see a great reduction in the number of predicted events since an observation will still need to have a probability higher than 50% (over 10 times the level of the background rate) to be predicted as having the target event. This typically results in selecting fewer (or no) input variables. Enterprise Miner has a way to account for this imbalance by adding additional 'weight' or value to the correct prediction of the target event. In fact, there is a button in the Target Profile that allows you to request

It is easy to accomplish this task by following the instructions in Usage Note 47965: Using priors and decision weights in SAS® Enterprise Miner(tm)which is available at

http://support.sas.com/kb/47/965.html

You might also consider reviewing the paper Identifying and Overcoming Common Data Mining Mistake which is available at

http://www2.sas.com/proceedings/forum2007/073-2007.pdf

where it has a discussion of handling target variable event levels occurring in different proportions on the bottom of page 6.

Implementing both the prior probabilities and the decisions weights as described above should provide you probabilities in the range that would be expected for your population and yield meaningful separation of your most likely and least likely observations.