turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- Using proc logistic for rare events-low probabilit...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-17-2012 11:13 AM

Hi,

I am developing a cross sell model based on response to a previous campaign. The issue that I have is I am modelling a rare event where the proportion of responders is only 6% compared to non responders of 94%. Having done some research. I over sampled to get a 50-50% distribution (keeping all responders).The proc logistic model runs well and produces probability scores which look good. But, I know that this needs to be adjusted as they are not representative of the overall population of non responders.. I have tried different methods as suggested in SAS forums ( weighting/offset/prior probabilities) and all of them dramatically reduce the predicted probabilities..Is there way a to maintain the predicted scores as they are with over sampling?Any help is much appreciated..

Thanks..

Accepted Solutions

Solution

08-17-2017
10:39 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to venkatm

08-17-2017 10:38 AM

There are quite of lot of different but related issues raised by your post. I'll try and handle them one-by-one.

Having done some research. I over sampled to get a 50-50% distribution (keeping all responders).

This is something that I typically recommend against. The notio of oversampling to 50/50 is a misapplication of the fact that the most power for predicting between two groups __ for a fixed sample size__ is to have an equal number of each group in the sample. This ends up being misinterpreted as needing to oversample to 50/50 for more power but you are reducing your sample size in this case so this principle does not apply. Depending on how rare your sample is, you might have poor representation of your non-events. For example, if you have a 1% response rate and oversample to 50/50, you now have 1% of your data representing the 99% of non-events.

The proc logistic model runs well and produces probability scores which look good. But, I know that this needs to be adjusted as they are not representative of the overall population of non responders.

The probability scores only look 'good' because the data has been oversampled to 50/50. Were those probabilities estimated from a data set that looked like the population, they would be more realistic and not 'look' as good. You can do posterior adjustment on these probabilities to make them look more realistic but that by definition will make them not 'look' as good.

Consider the following arguments:

* posterior adjustment methods simply adjust the original probabilities to be centered closer to the expected population rate, so the impact on individual probability estimate will vary

* posterior adjustment will not change the sort order of your for a given model, so the probability adjusted scores will be ordinally equivalent to the unadjusted scores

* models typically fit better on their training data than on data they are used to score, so the real evaluation of model performance comes from performance on new data

* given the above, it might not be necessary to 'adjust' the probabilities at all since even the 'adjusted' probabilities might not be meaningful when you consider that a probability is something calculated for a group and not an individual that either has the event of interest or does not have the event of interest

* you might be better off using more of your non-events to make sure that you are adequately modeling the features of the non-respondents and (hopefully) obtain a more discriminating model despite probabilities that don't 'look' as good.

I have tried different methods as suggested in SAS forums ( weighting/offset/prior probabilities) and all of them dramatically reduce the predicted probabilities..Is there way a to maintain the predicted scores as they are with over sampling?

As I mentioned above, that is to be expected when adjusting to the more realistic probabilities that using a less over-sampled data set (or viewing the 'adjusted' probabilities). The artificially high probabilities from the oversampled data are simply not realistic so they look 'good' but are not meaningful except for the way in which they allow you to sort the scored observations from most likely to least likely.

I hope this helps!

Doug

All Replies

Solution

08-17-2017
10:39 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to venkatm

08-17-2017 10:38 AM

There are quite of lot of different but related issues raised by your post. I'll try and handle them one-by-one.

Having done some research. I over sampled to get a 50-50% distribution (keeping all responders).

This is something that I typically recommend against. The notio of oversampling to 50/50 is a misapplication of the fact that the most power for predicting between two groups __ for a fixed sample size__ is to have an equal number of each group in the sample. This ends up being misinterpreted as needing to oversample to 50/50 for more power but you are reducing your sample size in this case so this principle does not apply. Depending on how rare your sample is, you might have poor representation of your non-events. For example, if you have a 1% response rate and oversample to 50/50, you now have 1% of your data representing the 99% of non-events.

The proc logistic model runs well and produces probability scores which look good. But, I know that this needs to be adjusted as they are not representative of the overall population of non responders.

The probability scores only look 'good' because the data has been oversampled to 50/50. Were those probabilities estimated from a data set that looked like the population, they would be more realistic and not 'look' as good. You can do posterior adjustment on these probabilities to make them look more realistic but that by definition will make them not 'look' as good.

Consider the following arguments:

* posterior adjustment methods simply adjust the original probabilities to be centered closer to the expected population rate, so the impact on individual probability estimate will vary

* posterior adjustment will not change the sort order of your for a given model, so the probability adjusted scores will be ordinally equivalent to the unadjusted scores

* models typically fit better on their training data than on data they are used to score, so the real evaluation of model performance comes from performance on new data

* given the above, it might not be necessary to 'adjust' the probabilities at all since even the 'adjusted' probabilities might not be meaningful when you consider that a probability is something calculated for a group and not an individual that either has the event of interest or does not have the event of interest

* you might be better off using more of your non-events to make sure that you are adequately modeling the features of the non-respondents and (hopefully) obtain a more discriminating model despite probabilities that don't 'look' as good.

I have tried different methods as suggested in SAS forums ( weighting/offset/prior probabilities) and all of them dramatically reduce the predicted probabilities..Is there way a to maintain the predicted scores as they are with over sampling?

As I mentioned above, that is to be expected when adjusting to the more realistic probabilities that using a less over-sampled data set (or viewing the 'adjusted' probabilities). The artificially high probabilities from the oversampled data are simply not realistic so they look 'good' but are not meaningful except for the way in which they allow you to sort the scored observations from most likely to least likely.

I hope this helps!

Doug