11-16-2011 10:17 AM
This is a very tricky but interesting problem. I hope some statistical expert can really help. Below is the description of the data and analysis.
1 million cases, 40 categorical variables with levels ranging from 2 to 50. The dependent variable is continuous which is pretty normally distributed.
2) Analysis: linear regression with effect coding for all categtorical variables.
Step 1): The intercept (i.e., overall mean) was estimated to be 570, with all the final significant variables in the model.
Step 2): remove 1307 exceptional cases (with leverage>2p/n and standardized residuals > 2), the intercept became 270.
For this analysis, the intercept changed so much from 570 to 270, with the fact that only 1307 cases deleted (compared to the sheer large sample size of 1 million). Because the intercept standards for the overall mean, this caused potential problem for the interpretation of the model: how can the overall mean change so significantly with only 1307 cases deleted?
I don't know how the intercept (overall mean) is estimated in general linear model. Any reference book?
Please help. Thanks.
11-17-2011 10:15 AM
Quite possibly a mixture problem. It looks like 0.13% of the data significantly elevate the intercept. That can happen. Consider the mean net worth of a very rural county, where 200 people live, say $50,000 per person. All of a sudden Bill Gates moves in, worth say $20B. The mean net worth is now 1991 times as large, with only a change of 0.5% of the data.
I would wager that those 1307 exceptional cases, when examined separately, tell you something quite interesting.