07-12-2011 01:49 PM
Hi, I completed the process of modelling binary response data using logistic regression. Given the singularity of the data, two methods were used to compare the results.
In the dataset, the binary dependent variable y has a very low probability of 3% for y=1. This can be classfied as rare events. The data sample is very large with 550,000 responses. Most of the variables are nomimal (only two are numeric). The two methods I used are:
1) Use PROC GENMOD to run the model for the entire dataset. At each step, the most insignificant variable (p<1%, type3) were excluded manually. This is like stepwise selection, but based on personal judegement.
2) Randomly resample the original dataset into 32 sub-samples. Each sub-sample contails all the cases with y=1 (i.e. rare events), and 3 times more cases with y=0. So the ratio is about 1:4. For each of the 32 sub-samples, a logistic regession analysis was performed. The final result of method 2) is the composite of the results from all the 32 individual models. That is, the intercept and coefficients are the average of the individual results. Finally, the intercept is adjusted based on the formula b' = b - log(f1/f2), where f1 and f2 are the sampling fraction for y=1 and y=0 respectively. The regression coefficients are not needed for adjustement due to the special characteristics of logit models.
Comparing the results of 1) and 2), it showed that method 1) has 4 more variables retained in the final model. Personally, I perfer method 2) which is recommended for dealing with rare events. But for method 1), I am trying to understand why it estimated more variables to be significant.
Thanks for sharing your knowledge