07-08-2011 10:51 AM
Hello, sorry to bother again. But the results got some problem here.
Basically I am using logistic regression to predict the event of cancellation (0, 1). The probability of cancellation (y=1) is very small. In the data, the probability that y=1 is only 3% (16694 out of the sample size 551346).
The logistic regression analysis identified a set of variables significant for y. But when I checked the residuals, the results are completely questionable. For the observations with residuals larger than 50, all of them fall into the category of y=1. In fact, there are 340 observations with residuals larger than 1000.
In this case, is it still appropriate to use logistic regression for this data?
I have an idea to discuss its feasibility. Instead of using all the non-cancellation observations (ie. y=0), I can take a small sample (say 3%). The logistic regression analysis based on the reduced sample (3% of y=0, 3% of y=1) will produce a model whose parameters need to be adjusted. According to the characteristics of logit models, the parameter estimates stay the same, but the intercept changes because of the sampling. With the intercept adjusted, the model estimates can become reasonable.
Thanks a lot for your advice.
07-08-2011 01:56 PM
3% on the outcome variable is not too small to use for a LOGISTIC model. It sounds like you might have some very rare events (or combinations of rare events) for predictors and that is upsetting your results.
Frank Harrell in his book "Regression Modelling Strategies" has suggestions for how to deal with these.
07-09-2011 06:08 PM
Many thanks for the idea. Sorry I don't understand why the rare events can upset the results. Could you please explain it a little more for me to understand? I guess that I need to identify the combination of rare events which might have all 1s or 0s as the outcome. But how? Given that the data have 20 variables, each of which have at least 5 levels.
I am happy to have the book Regression Modelling Strategies in hand. I checked and not quite sure which section deals with the problem.
07-08-2011 02:39 PM
Rare event is par for the course in business applications like acquisition, attrition, cross-sell, credit risk, fraud,... etc. All of them have low incidence rates, often below 1%. Becoming comfortable with these situations and developing startegy to deal them is quite important.
The reference suggested by Doc is a good starting point. They typically revolve around
In practice, there are people that favor one method or another, for business though, that argument is moot. What's worth money to businesses is consistent ability to provide high degree of differentiation above all else.
And often it is necessary to modify the performance criteria, for example, there may only be capacity to investigate a small number of cases. What matters then are not global performance metrics such as AUROC, KS,... instead, attention is paid only to performance at the very tip of the model.
At the end of the day, you use statistics to get what you want, rather than letting statistics control what you do. Real business decisions often extend beyond statistics.
07-09-2011 06:29 PM
I appreciate your knowledge. Thanks! I think I need to find a logit model book to read .
Please allow me to discuss this real-life problem. You mean that even the event y has low probability p(y=1)<1%, we can still apply logit model to model it. But even for an excellent model, the success rate of prediction will be lower than predicting that y is always equal to 0. That is, given that the event has a very low probability 1% in the observed data, then I just guess y is always false (y=0) which has a successfuly guess rate of 99%.
Nevertheless, it is still valuable to apply logit model to identify significant predictors (x's). Now if the observed probability of y=1 is 1%. I can take another 1% from the sample of y=0, and combine them into a sample. After running the logit model, I simply make adjustment for the intercept, not for the coefficients which are assumed to be invariable for the sampling. For this approach, there is a potential problem that some rare categories (or combinations) can be lost due to the sampling from the sample y=0 from which only 1% is drawn.