BookmarkSubscribeRSS Feed
bncoxuk
Obsidian | Level 7

This is a very tricky but interesting problem. I hope some statistical expert can really help. Below is the description of the data and analysis.

1) Data

1 million cases, 40 categorical variables with levels ranging from 2 to 50. The dependent variable is continuous which is pretty normally distributed.

2) Analysis: linear regression with effect coding for all categtorical variables.

Step 1): The intercept (i.e., overall mean) was estimated to be 570, with all the final significant variables in the model.

Step 2): remove 1307 exceptional cases (with leverage>2p/n and standardized residuals > 2), the intercept became 270.

For this analysis, the intercept changed so much from 570 to 270, with the fact that only 1307 cases deleted (compared to the sheer large sample size of 1 million). Because the intercept standards for the overall mean, this caused potential problem for the interpretation of the model: how can the overall mean change so significantly with only 1307 cases deleted? 

I don't know how the intercept (overall mean) is estimated in general linear model. Any reference book?

Please help. Thanks.

3 REPLIES 3
Ksharp
Super User

Those deleted obs maybe valuable obs.

DId you check the COOK distance  to see the contrubution of these obs to your model?

Ksharp

bncoxuk
Obsidian | Level 7

Thanks, Ksharp. COOK distance helped the model.

SteveDenham
Jade | Level 19

Quite possibly a mixture problem.  It looks like 0.13% of the data significantly elevate the intercept.  That can happen.  Consider the mean net worth of a very rural county, where 200 people live, say $50,000 per person.  All of a sudden Bill Gates moves in, worth say $20B.  The mean net worth is now 1991 times as large, with only a change of 0.5% of the data.

I would wager that those 1307 exceptional cases, when examined separately, tell you something quite interesting.

Steve Denham

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1267 views
  • 3 likes
  • 3 in conversation