Hello, sorry to bother again. But the results got some problem here. Basically I am using logistic regression to predict the event of cancellation (0, 1). The probability of cancellation (y=1) is very small. In the data, the probability that y=1 is only 3% (16694 out of the sample size 551346). The logistic regression analysis identified a set of variables significant for y. But when I checked the residuals, the results are completely questionable. For the observations with residuals larger than 50, all of them fall into the category of y=1. In fact, there are 340 observations with residuals larger than 1000. In this case, is it still appropriate to use logistic regression for this data? I have an idea to discuss its feasibility. Instead of using all the non-cancellation observations (ie. y=0), I can take a small sample (say 3%). The logistic regression analysis based on the reduced sample (3% of y=0, 3% of y=1) will produce a model whose parameters need to be adjusted. According to the characteristics of logit models, the parameter estimates stay the same, but the intercept changes because of the sampling. With the intercept adjusted, the model estimates can become reasonable. Thanks a lot for your advice.
... View more