For categorical targets, SAS Enterprise Miner assigns each observation to the most likely (highest probability) outcome by default. For a binary target, this is the same as the outcome with a probability higher than 0.50 (or 50%). Suppose you have applied a prior to adjust the probabilities to reflect a population where the event rate is 5% and the non-event rate is 95%. Using the same method to assess the outcome, you will likely see a great reduction in the number of predicted events since an observation will still need to have a probability higher than 50% (over 10 times the level of the background rate) to be predicted as having the target event. This typically results in selecting fewer (or no) input variables. Enterprise Miner has a way to account for this imbalance by adding additional 'weight' or value to the correct prediction of the target event. In fact, there is a button in the Target Profile that allows you to request Default with Inverse Prior Weights. This essentially assigns anyone who is more likely than average (higher than 5%) to be chosen as having the event. Of course, this will greatly inflate the number of people classified as having the event but you should not be as concerned about the actual prediction class as the sort order of the probabilities. You can always review the actual probability scores and determine a cutoff (e.g. using the Cutoff node) at which you want to make the prediction; however, you will allow more variables to be considered in helping you determine which observations are more likely than others, even in the case of a rare event.
It is easy to accomplish this task by following the instructions in Usage Note 47965: Using priors and decision weights in SAS® Enterprise Miner(tm)which is available at
http://support.sas.com/kb/47/965.html
You might also consider reviewing the paper Identifying and Overcoming Common Data Mining Mistake which is available at
http://www2.sas.com/proceedings/forum2007/073-2007.pdf
where it has a discussion of handling target variable event levels occurring in different proportions on the bottom of page 6.
Implementing both the prior probabilities and the decisions weights as described above should provide you probabilities in the range that would be expected for your population and yield meaningful separation of your most likely and least likely observations.
Hi,
Try this.
www.ecu.edu/cs-dhs/bios/upload/Logistic.pdf
http://www.talkstats.com/showthread.php/11801-Modeling-Rare-Event-Data
http://www.mathkb.com/Uwe/Forum.aspx/sas/12662/modeling-rare-event-rate
http://blog.data-miners.com/2008/05/adjusting-for-oversampling.html
Regards,
Murphy
When I model rare (1%) events, I use oversampling as described in "Mastering Data Mining" (Berry). Basically, I train on a stratified sample so the rare event is 10-30% dense. (I model in R, not in in SAS.)
(I work for SAS. My remarks below are just my personal opinions on the subject)
1. One way is exact logistic regression, or other permutation based estimation.
2. 'Regular' logistic regression is known to produce biased estimates sometimes (I read a paper from American Statistics Association long time ago that when the % of 1 is <0.57, bias may occur. Forgot the title of the paper). This stems from the percentage of 1 is too low in the whole model universe. I believe when you say rare events, this is what you mean. In Enterprise Miner 6.2 (and later), when you drag a data set, you can customize in decision prior. And cost and profit and so on. This, however, perhaps is not the answer to your question. For example, your model universe has 0.6% =1 and 99.4%. You call it rare. If you believe the 'true' event (=1) percentage in the population (vs. the sample on your hand due to all kinds of constraints you had in collecting the sample) should be 5% (you know beforehand, that is what the word PRIOR means), you can specify that in Decision and Prior. That, in quantitative sense, makes your model closer to the real population. This reality addresses the bias in term of a healthy, normal percentage structure through reweighting the count of the target. It is, by definition, different from a good model sample that has original event =5%. Through the weighting, you enjoy the cost and benefit of artifically more homogeneous data.
3. In some sense, logistic regression (proc genmod is better than proc logistic in degree, but eventually similar shortcoming on the biasedness) is unfortunate tool for rare event modeling. In Enterprise Miner, look into Rule Induction for a possible better prediction tool. Rare events do not necessarily imply insufficient event counts. In credit card fraud detection models, for example, you can have in your modeling universe over 15,000 fraund cases (event=1), but still rare since the % of <0.5. If your application has sensitive real time action requirement, consider dumping logistic regression entirely. Many applications in medical research or political sciences have rare event ratio where estimation / hypothesis is more critical then prediction, then permutation is the method since those samples are more likely to be fairly small, considering how faster you PC is running today.
Hope this helps.
Jason Xin
Hello Jason:
I think you can help me. I am using EM 6.2 to predict a crash severity event "target "1" however it is a rare event hance it is a challange.
My sample size is 1374 crash observations involving all injuriesand/or fatalities. The three data sets are:
Each crash observations report an injury and/or fatality.The predictions models are intended to target the probability of having eithera serious injury either and/or a fatality outcome given any injury, (P (Severe Injury or Fatality | Injury)).The Target is defined as a binary variable (Target = 1 for an event and 0 for anon-event). It is an imbalanced sample with the following characteristics:
I am modeling the Target “1” which just happens 5% of thetime. Hence I did the oversampling and adjusting the priorities as follows.
1stOVERSAMPLING to include all the rare events (Target “1”)at the sample and equal number of Target “0”
-I add a sample node to the DataSource (the originalpopulation N=1374) in the new diagram (without partition node)
-At sample node property panel, I set: 100.0 percent,Level Based, Rarest Level, Level Proportion 100.0, and Sample proportion 50,0.
-I add LOGISTIC REGRESSION nodes
-I add MODEL COMPARISON node
-I add a SCORE Node to the model selected by the bestmodel node
-I add new data source into the diagram, which is theoriginal population data table and the role set to “Score”
-I add SAS CODE node to the Score node
- I run the SAS Code and then I run the score node
2ndADJUSTING PROBABILITIES to predict the correctly the original distribution
-I add a DECISION node following the modeling node(select model)
At the decision node I set the prior probabilities as:
a) Level “1”, Count (70), Prior (0.5), Adjusted Prior(0.05)
b) Level “1”, Count (70), Prior (0.5), Adjusted Prior(0.95)
c) I applied the decisions by setting “yes” and I runthis node
Then, I run again thescore node at the diagram, as the results are below.
The event classification table at the Decision node showsthe following results:
FN (70), TN (70), FP (0) TP (0)
The score node results after applying the decision nodewith prior probabilities show the values:
Target “0” 99.6%
Target “1” 0.36%
These results do not make sense because in the originalpopulation the percent of Target “1” was 5%;
I didnt know ho to set the decison tab nor the cost, nor the weight????
Your advice for the best apporach to optimize my prediction model is very appreciated.
I look foward to hearing from YOU.
Regards
Mina
For categorical targets, SAS Enterprise Miner assigns each observation to the most likely (highest probability) outcome by default. For a binary target, this is the same as the outcome with a probability higher than 0.50 (or 50%). Suppose you have applied a prior to adjust the probabilities to reflect a population where the event rate is 5% and the non-event rate is 95%. Using the same method to assess the outcome, you will likely see a great reduction in the number of predicted events since an observation will still need to have a probability higher than 50% (over 10 times the level of the background rate) to be predicted as having the target event. This typically results in selecting fewer (or no) input variables. Enterprise Miner has a way to account for this imbalance by adding additional 'weight' or value to the correct prediction of the target event. In fact, there is a button in the Target Profile that allows you to request Default with Inverse Prior Weights. This essentially assigns anyone who is more likely than average (higher than 5%) to be chosen as having the event. Of course, this will greatly inflate the number of people classified as having the event but you should not be as concerned about the actual prediction class as the sort order of the probabilities. You can always review the actual probability scores and determine a cutoff (e.g. using the Cutoff node) at which you want to make the prediction; however, you will allow more variables to be considered in helping you determine which observations are more likely than others, even in the case of a rare event.
It is easy to accomplish this task by following the instructions in Usage Note 47965: Using priors and decision weights in SAS® Enterprise Miner(tm)which is available at
http://support.sas.com/kb/47/965.html
You might also consider reviewing the paper Identifying and Overcoming Common Data Mining Mistake which is available at
http://www2.sas.com/proceedings/forum2007/073-2007.pdf
where it has a discussion of handling target variable event levels occurring in different proportions on the bottom of page 6.
Implementing both the prior probabilities and the decisions weights as described above should provide you probabilities in the range that would be expected for your population and yield meaningful separation of your most likely and least likely observations.
Hi,
Let me explain my situation :
1) I have a dataset - where the response rate is 0.6% (374 events in a total of 61279 records) and I need to build a logistic regression model on this dataset.
2) Option 1 : I can go with PROC LOGISTIC (conventional Maximum Likelihood) as the thumb rule " that you should have at least 10 events for each parameter estimated" should hold good considering that I start my model build iteration with not more than 35 variables and finalize the model build with less than 10 variables.
Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (conventional ML) with the understanding that I may have to do certain categorical level collapses to rule out cases of quasi complete separation/ complete separation and considering the thumb rule " that you should have at least 10 events for each parameter estimated" ?
3) Option -2 : I can go with PROC LOGISTIC (Firth's Method using Penalized Likelihood) - The Firth method could be helpful in reducing any small-sample bias of the estimators.
Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (Firth's Method using Penalized Likelihood) with the understanding that I DO NOT have to do any categorical level collapses to rule out cases of quasi complete separation/ complete separation ?
4) Option -3 : If the above 2 options is not recommended , then the last option is to use the strategy for Over sampling of rare events. As the total number of events-374 and total records-61279 are both quite less with regards to posing any challenges on computing time or on hardware, I would obviously go with a oversampling rate of 5% only (Number of records to be modelled=6487) as I want to consider as many non-event records as possible as if I go for oversampling rate above 5% , the total number of records that can be modeled is less than 6487 .
My thoughts on Option-1,Option-2 or Option-3 as given below :
-- With a 5.77 oversampled rate, number of events = 374 and number of non-events=6113, a total of 6487 records. with a 70:30 split between TRAIN and VALIDATION , I can build my model on 4541 records and perform intime validation on 1946 records.
-- Comparing to Option-1 and Option-2, with a 70:30 split between TRAIN and VALIDATION , I can build my model on 42896 records and perform intime validation on 18383 records.
Regarding Option-1,Option-2 or Option-3 , Please do help me with which option is recommended for me - Option-1,Option-2 or Option-3 in my case ? If Option-3, then is it recommended to use a oversampling rate of either 2% or 3% in order to increase the number of records to be modeled to something above 6487 ?
Thanks
Surajit
With a 5.77 oversampled rate, number of events = 374 and number of non-events=6113, a total of 6487 records. with a 70:30 split between TRAIN and VALIDATION , I can build my model on 4541 records and perform intime validation on 1946 records.
with a 70:30 split between TRAIN and VALIDATION , I can build my model on 42896 records and perform intime validation on 18383 records.
Some things that I would share based on my personal opinion and my limited knowledge of your modeling scenario:
* I tend to think in terms of how many actual events I have rather than how many total records I have. If I have 100 events and a 100 non-events, I have 200 observations. If I have 100 events and 99,900 non-events, do I really have that much more information? The signal in that case is so low (0.1%) that it would be difficult to have much confidence in any fitted model.
* The total number of events (374) is so low that I would consider not even splitting the data in this situation. Data Mining methods like partitioning assume that there are sufficient observations to represent the population in every partitioned data set. Splitting 70/30 leaves barely over 100 events in validation. Your data set might be better handle by classical statistical approaches given the limited data available.
* I always fit a Tree to better understand structure in the data. Trees can show you relationships in the data that you might not find from looking at a logistic regression model. It can also show when you have specified a variable that you shouldn't have because it was actually an outcome variable (happens more often than you think).
If you need to build a model on the limited data you have, you should consider monitoring model performance. The less signal the data has, the less models are truly viable. Regardless of what model you fit, the confidence you can have in the usefulness of the model will be typically be greater when you have a larger number of target events, but that should not stop you from trying different methods and evaluating them with regards to your analysis objective. It might also be that you have good and useful predictors and you don't actually need a lot more data. The nice thing about modern software is that we can fit lots of different models and compare the results. I don't ever start by choosing a single strategy.
Hope this helps!
Cordially,
Doug
Hi Doug,
Thanks a ton for your inputs. Just trying to summarize from your thoughts :
1) PROC LOGISTIC (Conventional MLE Estimates) & PROC LOGISTIC (Firth's Penalized Maximum Likelihood) : Not a viable option as per your comment as given below to which I fully agree. But at the same time , just wanted to know if I go with this approach then is a 70:30 split advisable between TRAIN and IN TIME VALIDATION considering the low response rate of 0.6% or TRAIN and OUT OF TIME VALIDATION only recommended ?
"I tend to think in terms of how many actual events I have rather than how many total records I have. If I have 100 events and a 100 non-events, I have 200 observations. If I have 100 events and 99,900 non-events, do I really have that much more information? The signal in that case is so low (0.1%) that it would be difficult to have much confidence in any fitted model."
2) PROC LOGISTIC (Oversampled Rate of 5.77 %) : Splitting into TRAIN and INTIME VALIDATION is not recommended as per your comment as given below to which I fully agree. As INTIME VALIDATION is not recommended , then is OUT OF TIME VALIDATION the only option for model testing in this scenario ?
"* The total number of events (374) is so low that I would consider not even splitting the data in this situation. Data Mining methods like partitioning assume that there are sufficient observations to represent the population in every partitioned data set. Splitting 70/30 leaves barely over 100 events in validation. Your data set might be better handle by classical statistical approaches given the limited data available. "
3) PROC LOGISTIC (Oversampled Rate of 5.77 %) : Apart from using a decision tree to understand more about the data , is there any other suggestion with regards to use of classical statistical approaches given the limited data available ?
Thanks
Surajit
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.