turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Procedures
- /
- Modelling rare events with logistic regression

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-12-2011 01:49 PM

Hi, I completed the process of modelling binary response data using logistic regression. Given the singularity of the data, two methods were used to compare the results.

In the dataset, the binary dependent variable y has a very low probability of 3% for y=1. This can be classfied as rare events. The data sample is very large with 550,000 responses. Most of the variables are nomimal (only two are numeric). The two methods I used are:

1) Use PROC GENMOD to run the model for the entire dataset. At each step, the most insignificant variable (p<1%, type3) were excluded manually. This is like stepwise selection, but based on personal judegement.

2) Randomly resample the original dataset into 32 sub-samples. Each sub-sample contails all the cases with y=1 (i.e. rare events), and 3 times more cases with y=0. So the ratio is about 1:4. For each of the 32 sub-samples, a logistic regession analysis was performed. The final result of method 2) is the composite of the results from all the 32 individual models. That is, the intercept and coefficients are the average of the individual results. Finally, the intercept is adjusted based on the formula b' = b - log(f1/f2), where f1 and f2 are the sampling fraction for y=1 and y=0 respectively. The regression coefficients are not needed for adjustement due to the special characteristics of logit models.

Comparing the results of 1) and 2), it showed that method 1) has 4 more variables retained in the final model. Personally, I perfer method 2) which is recommended for dealing with rare events. But for method 1), I am trying to understand why it estimated more variables to be significant.

Thanks for sharing your knowledge

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Ruth

07-12-2011 03:55 PM

Noise.

The additional parameters probably fit the residual noise after fitting all of the other variables.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to SteveDenham

07-13-2011 04:31 AM

Hi Steve, so the additional parameters are really non-sense.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Ruth

07-13-2011 07:28 AM

That is how I would view them. As compared to method 2, in any case.