Hi,
I am developing a cross sell model based on response to a previous campaign. The issue that I have is I am modelling a rare event where the proportion of responders is only 6% compared to non responders of 94%. Having done some research. I over sampled to get a 50-50% distribution (keeping all responders).The proc logistic model runs well and produces probability scores which look good. But, I know that this needs to be adjusted as they are not representative of the overall population of non responders.. I have tried different methods as suggested in SAS forums ( weighting/offset/prior probabilities) and all of them dramatically reduce the predicted probabilities..Is there way a to maintain the predicted scores as they are with over sampling?Any help is much appreciated..
Thanks..
There are quite of lot of different but related issues raised by your post. I'll try and handle them one-by-one.
Having done some research. I over sampled to get a 50-50% distribution (keeping all responders).
This is something that I typically recommend against. The notio of oversampling to 50/50 is a misapplication of the fact that the most power for predicting between two groups for a fixed sample size is to have an equal number of each group in the sample. This ends up being misinterpreted as needing to oversample to 50/50 for more power but you are reducing your sample size in this case so this principle does not apply. Depending on how rare your sample is, you might have poor representation of your non-events. For example, if you have a 1% response rate and oversample to 50/50, you now have 1% of your data representing the 99% of non-events.
The proc logistic model runs well and produces probability scores which look good. But, I know that this needs to be adjusted as they are not representative of the overall population of non responders.
The probability scores only look 'good' because the data has been oversampled to 50/50. Were those probabilities estimated from a data set that looked like the population, they would be more realistic and not 'look' as good. You can do posterior adjustment on these probabilities to make them look more realistic but that by definition will make them not 'look' as good.
Consider the following arguments:
* posterior adjustment methods simply adjust the original probabilities to be centered closer to the expected population rate, so the impact on individual probability estimate will vary
* posterior adjustment will not change the sort order of your for a given model, so the probability adjusted scores will be ordinally equivalent to the unadjusted scores
* models typically fit better on their training data than on data they are used to score, so the real evaluation of model performance comes from performance on new data
* given the above, it might not be necessary to 'adjust' the probabilities at all since even the 'adjusted' probabilities might not be meaningful when you consider that a probability is something calculated for a group and not an individual that either has the event of interest or does not have the event of interest
* you might be better off using more of your non-events to make sure that you are adequately modeling the features of the non-respondents and (hopefully) obtain a more discriminating model despite probabilities that don't 'look' as good.
I have tried different methods as suggested in SAS forums ( weighting/offset/prior probabilities) and all of them dramatically reduce the predicted probabilities..Is there way a to maintain the predicted scores as they are with over sampling?
As I mentioned above, that is to be expected when adjusting to the more realistic probabilities that using a less over-sampled data set (or viewing the 'adjusted' probabilities). The artificially high probabilities from the oversampled data are simply not realistic so they look 'good' but are not meaningful except for the way in which they allow you to sort the scored observations from most likely to least likely.
I hope this helps!
Doug
There are quite of lot of different but related issues raised by your post. I'll try and handle them one-by-one.
Having done some research. I over sampled to get a 50-50% distribution (keeping all responders).
This is something that I typically recommend against. The notio of oversampling to 50/50 is a misapplication of the fact that the most power for predicting between two groups for a fixed sample size is to have an equal number of each group in the sample. This ends up being misinterpreted as needing to oversample to 50/50 for more power but you are reducing your sample size in this case so this principle does not apply. Depending on how rare your sample is, you might have poor representation of your non-events. For example, if you have a 1% response rate and oversample to 50/50, you now have 1% of your data representing the 99% of non-events.
The proc logistic model runs well and produces probability scores which look good. But, I know that this needs to be adjusted as they are not representative of the overall population of non responders.
The probability scores only look 'good' because the data has been oversampled to 50/50. Were those probabilities estimated from a data set that looked like the population, they would be more realistic and not 'look' as good. You can do posterior adjustment on these probabilities to make them look more realistic but that by definition will make them not 'look' as good.
Consider the following arguments:
* posterior adjustment methods simply adjust the original probabilities to be centered closer to the expected population rate, so the impact on individual probability estimate will vary
* posterior adjustment will not change the sort order of your for a given model, so the probability adjusted scores will be ordinally equivalent to the unadjusted scores
* models typically fit better on their training data than on data they are used to score, so the real evaluation of model performance comes from performance on new data
* given the above, it might not be necessary to 'adjust' the probabilities at all since even the 'adjusted' probabilities might not be meaningful when you consider that a probability is something calculated for a group and not an individual that either has the event of interest or does not have the event of interest
* you might be better off using more of your non-events to make sure that you are adequately modeling the features of the non-respondents and (hopefully) obtain a more discriminating model despite probabilities that don't 'look' as good.
I have tried different methods as suggested in SAS forums ( weighting/offset/prior probabilities) and all of them dramatically reduce the predicted probabilities..Is there way a to maintain the predicted scores as they are with over sampling?
As I mentioned above, that is to be expected when adjusting to the more realistic probabilities that using a less over-sampled data set (or viewing the 'adjusted' probabilities). The artificially high probabilities from the oversampled data are simply not realistic so they look 'good' but are not meaningful except for the way in which they allow you to sort the scored observations from most likely to least likely.
I hope this helps!
Doug
Hi,
I true appreciate the thought that having an over sampled data set with a 50/50 ratio of events/non/events is not recommended in real world scenarios which basically eliminates a lot of non-events and the model is unable to learn about non-events.
If 50/50 ratio of events/non/events is always not the right approach to model rare events because of above, any guidelines on what the right approach is considering total number of records in the dataset (non oversampled) ?
Thanks
Surajit
As with many things in life, you must balance business needs vs. computing time vs. model performance. Using more observations takes more time but can increase model performance up to a point, but using more time to build a model means less candidate models can be built. In many cases, it might have been better to build more models (e.g. using different methods of data preparation and/or prediction) rather than just training a smaller set of models for a longer time on a larger amount of observations. For any given problem, there is a point of diminishing returns where increasing training observations/time gets you little or no increases in model performance. There is no easy way to determine where that line should be drawn, however, since business needs, time available to model, data available to model, interpretability requirements (some models are more interpretable than others) can vary wildly making this a business decision as well as an analyst decision.
For myself (not speaking for SAS), I prefer to see at least 5% of the observations have the event when the percentage in the population is smaller, and I consider letting that percentage of rare events in the sample range up to 20% if I am trying to manage model run time due to the large number of observations available to model. Note: SAS Rapid Predictive Modeler oversamples to a 10% response rate which is nicely within that range. Suppose I have 5,000,000 observations but the response rate is only 1% which corresponds to 50,000 events. If I take all the rare events in the sample, than they would represent 50,000 events to 950,000 non-events if oversampled to 5% (1 million observations modeled) but those same events would be matched with only 200,000 non-events if oversampled to 20% (250,000 observations modeled). Modeling time depends in part on the hardware available to do the computation/data management, and modeling the smaller number of observations (20% of sample) might lead to far more candidate models than modeling on the larger number of observations (5% of sample). This sometimes leads to a strategy of doing preliminary modeling on a data set with a higher response rate (fewer non-events in a rare event scenario) to identify key predictors and useful modeling strategies, and then further modeling is done on a more representative data set limited to a smaller set of input variables and/or modeling strategies to meet the time requirements imposed by the business objectives. A good answer today is sometimes better than a (perhaps) slightly better answer tomorrow or next week.
Hope this helps!
Cordially,
Doug
Hi Doug,
Thanks a ton for the very good explanation. Let me explain my situation :
1) I have a dataset - where the response rate is 0.6% (374 events in a total of 61279 records) and I need to build a logistic regression model on this dataset.
2) Option 1 : I can go with PROC LOGISTIC (conventional Maximum Likelihood) as the thumb rule " that you should have at least 10 events for each parameter estimated" should hold good considering that I start my model build iteration with not more than 35 variables and finalize the model build with less than 10 variables.
Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (conventional ML) with the understanding that I may have to do certain categorical level collapses to rule out cases of quasi complete separation/ complete separation and considering the thumb rule " that you should have at least 10 events for each parameter estimated" ?
3) Option -2 : I can go with PROC LOGISTIC (Firth's Method using Penalized Likelihood) - The Firth method could be helpful in reducing any small-sample bias of the estimators.
Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (Firth's Method using Penalized Likelihood) with the understanding that I DO NOT have to do any categorical level collapses to rule out cases of quasi complete separation/ complete separation ?
4) Option -3 : If the above 2 options is not recommended , then the last option is to use the strategy for Over sampling of rare events. As the total number of events-374 and total records-61279 are both quite less with regards to posing any challenges on computing time or on hardware, I would obviously go with a oversampling rate of 5% only (Number of records to be modelled=6487) as I want to consider as many non-event records as possible as if I go for oversampling rate above 5% , the total number of records that can be modeled is less than 6487 .
My thoughts on this process as given below :
-- With a 5.77 oversampled rate, number of events = 374 and number of non-events=6113, a total of 6487 records. with a 70:30 split between TRAIN and VALIDATION , I can build my model on 4541 records and perform intime validation on 1946 records.
-- Comparing to Option-1 and Option-2, with a 70:30 split between TRAIN and VALIDATION , I can build my model on 42896 records and perform intime validation on 18383 records.
Please do help me with which option is recommended for me - Option-1,Option-2 or Option-3 in my case ? If Option-3, then is it recommended to use a oversampling rate of either 2% or 3% in order to increase the number of records to be modeled to something above 6487 ?
Thanks
Surajit
Hi Doug,
I am working for a client and my situation does not enable me to go for Option -3 . I say so because , in Option-3 , Out-Of-Sample Validation becomes the only option for validating the model and client is not ok with providing data for out of time validation testing.
Hence, with Option-3 ruled out, I need to go with either Option-1 or Option-2 only as listed below :
Option 1 : I can go with PROC LOGISTIC (conventional Maximum Likelihood) as the thumb rule " that you should have at least 10 events for each parameter estimated" should hold good considering that I start my model build iteration with not more than 35 variables and finalize the model build with less than 10 variables.
Option -2 : I can go with PROC LOGISTIC (Firth's Method using Penalized Likelihood) - The Firth method could be helpful in reducing any small-sample bias of the estimators.
Questions for Option-1 :
-- With a response rate is 0.6% (374 events in a total of 61279 records), is it advisable to split 61279 records in 70:30 ratio for Model Build and In-time Validation ? I know that if I split 61279 records in 70:30 ratio for Model Build and In-time Validation, I am reducing the numbers of responses in the Model Build Dataset from 374 to 262. I am forced to think in this way as I am unable to obtain any data for Out-Of-Time Validation Testing and if I do not opt to to split 61279 records in 70:30 ratio for Model Build and In-time Validation, I cannot validate my model in any way.
-- Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (conventional ML) with the understanding that I may have to do certain categorical level collapses to rule out cases of quasi complete separation/ complete separation and considering the thumb rule " that you should have at least 10 events for each parameter estimated" ?
Questions for Option-2 :
-- With a response rate is 0.6% (374 events in a total of 61279 records), is it advisable to split 61279 records in 70:30 ratio for Model Build and In-time Validation ? I know that if I split 61279 records in 70:30 ratio for Model Build and In-time Validation, I am reducing the numbers of responses in the Model Build Dataset from 374 to 262. I am forced to think in this way as I am unable to obtain any data for Out-Of-Time Validation Testing and if I do not opt to to split 61279 records in 70:30 ratio for Model Build and In-time Validation, I cannot validate my model in any way.
-- Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (Firth's Method using Penalized Likelihood) with the understanding that I DO NOT have to do any categorical level collapses to rule out cases of quasi complete separation/ complete separation ?
Thanks
Surajit
Please do help me with which option is recommended for me - Option-1,Option-2 or Option-3 in my case ? If Option-3, then is it recommended to use a oversampling rate of either 2% or 3% in order to increase the number of records to be modeled to something above 6487 ?
I don't know that one could say in most situations let alone a specific situation which approach will be most effective since this could differ depending on the data and the way in which the relationships among the variables might be changing over time. The one that performs the best on the data you train on is less important than the one that works best in practice. It is common to build multiple models and to pick which is the champion and which are the challengers based on their performance in practice. I would typically say build multiple models and use your business judgement, needs for interpretation, statistical metrics of interest (each modeler might prefer different ones for different reasons), and go with it. Over time, you should find yourself knowing which approach is best for your modeling situation and refitting a new model can occur more quickly. Given the speed at which models can be built on modern software, you really shouldn't have to choose in advance.
Hope this helps!
Doug
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.