BookmarkSubscribeRSS Feed
Ruth
Fluorite | Level 6

Hi, I completed the process of modelling binary response data using logistic regression. Given the singularity of the data, two methods were used to compare the results.

In the dataset, the binary dependent variable y has a very low probability of 3% for y=1. This can be classfied as rare events. The data sample is very large with 550,000 responses. Most of the variables are nomimal (only two are numeric). The two methods I used are:

1) Use PROC GENMOD to run the model for the entire dataset. At each step, the most insignificant variable (p<1%, type3) were excluded manually. This is like stepwise selection, but based on personal judegement.

2) Randomly resample the original dataset into 32 sub-samples. Each sub-sample contails all the cases with y=1 (i.e. rare events), and 3 times more cases with y=0. So the ratio is about 1:4. For each of the 32 sub-samples, a logistic regession analysis was performed. The final result of method 2) is the composite of the results from all the 32 individual models. That is, the intercept and coefficients are the average of the individual results. Finally, the intercept is adjusted based on the formula b' = b - log(f1/f2), where f1 and f2 are the sampling fraction for y=1 and y=0 respectively. The regression coefficients are not needed for adjustement due to the special characteristics of logit models.

Comparing the results of 1) and 2), it showed that method 1) has 4 more variables retained in the final model. Personally, I perfer method 2) which is recommended for dealing with rare events. But for method 1), I am trying to understand why it estimated more variables to be significant.

Thanks for sharing your knowledgeSmiley Happy

3 REPLIES 3
SteveDenham
Jade | Level 19

Noise.

The additional parameters probably fit the residual noise after fitting all of the other variables.

Ruth
Fluorite | Level 6

Hi Steve, so the additional parameters are really non-sense.

SteveDenham
Jade | Level 19

That is how I would view them.  As compared to method 2, in any case.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 2258 views
  • 0 likes
  • 2 in conversation