Checking ANOVA assumptions visually using residual plots

3 Likes

ANOVA assumes that residuals (errors) are normally distributed and terms have equal variance (homoscedasticity, antonym heteroscedasticity). Professional statisticians frequently check ANOVA assumptions visually.

We bring forth a dataset that formed the basis of a paper describing Calluna (heath) plants’ response to Nitrogen and Drought tolerance. Nitrogen, plant source (heathland), and drought were applied in a 2*2*2 factorial. Researchers randomized plants in a greenhouse, with 10 plant pots per treatment unit (n=10), tested over two years.

This dataset holds some interesting clues about nitrogen and drought effects on heath plants. But before relying too much on the output, we should test the assumptions. How is that done visually?

ods graphics on;
*/The graphics statement turns on the ability to display plots*/;
proc mixed data=Heath.data Plots(only)=(studentpanel(conditional) Boxplot(conditional));
*/The plots option specifies two types of plots are output. The first the student panel, and the second are treatment-specific boxplots. Conditional option within those require calculation of residuals based on the model specification, i.e. taking into account the relationship of the treatments to one-another in the factorial design*/;
class Year Heathland Nitrogen Drought Replicate;
model 'dry weight above (g)'n= Drought Nitrogen Drought*nitrogen Heathland Heathland*Drought Heathland*Nitrogen Heathland*Drought*Nitrogen;
random 'Year'n;
RUN;

Studentized Residuals Including Q-Q plot

Studentized residuals clearly demonstrate a bimodal distribution in residual variance.

Bimodal distribution of variance

By-Treatment Boxplots

Let’s take a look at the boxplots to try to understand trends of unexplained variance.

Unequal variance among watering treatments

Non-Homogenous Residual Variance

By far the widest boxplot range of residuals is from the well-watered treatment. This appears to be the culprit for the unequal variance. The data points associated with well-watered treatment skew high and low. Perhaps individual plants responded to plenty of water water either well or poorly. Next time, it might be useful to keep this in mind and capture watering response as an explanatory variable.

Non-Normal Residual Variance

While the watering treatment represents a departure from equal variance, this was not the cause for the non-normal distribution. We can see this by reviewing median residual points, which are similar among the two watering treatments. The non-normality was due to another factor: notice the skew in the boxplots’ medians of year and nitrogen. Digging into the data, the results point to the two years producing different drought and nitrogen treatment effects for above ground dry weight. For this reason, it could be advisable to analyze each experiment independently by year.

Testing ANOVA assumptions need not be a checkbox exercise. The visual review of residuals allows researchers to make the most of our experiments and data models.

PaigeMiller · ‎06-15-2020

Very good article for beginners.

I think the first sentence has an omission. I think it should say "ANOVA assumes that residuals (errors) are independent and normally distributed and terms have equal variance (homoscedasticity, antonym heteroscedasticity)."

I would like to show this article to people at some point in time, but the graphics appear too small to really be useful. Can this be fixed by the author?

@ChrisHemedinger there is no author's name shown on this page, I believe that is also an omission.

ChrisHemedinger · ‎06-15-2020

The author is John Gottula, a SAS employee focuses on AgTech (a renewed focus area for SAS). I'll reach out to see if he has a better version of these graphics. Thanks for the comments!

JackHamilton · ‎06-15-2020

The Statistical Analysis System's roots in agriculture are mostly unknown nowadays. It would be interesting to see a presentation on SAS's use in Ag now vs. then. Is there a completely different set of users, perhaps different crops or different farm sizes?

Graphics are much better now, and there's much more variety and power in modeling procedures, but I think box plots have been around for a long time.

PaigeMiller · ‎06-26-2020

@ChrisHemedinger your reply does not address my concern, or perhaps I didn't state it clearly enough.

There should be a by-line underneath the article title near the top of the page for these posts in the SAS Communities Library. The by-line can use the author's SAS Communities id, in this case jozgot, but it should be up there.

ChrisHemedinger · ‎06-26-2020

@PaigeMiller I see what you mean. Articles in the library can have multiple contributors, so they are listed on the side in the "Contributors" widget. But I take your meaning -- we could have the "primary" author list at top so as to provide a by-line appearance.

PaigeMiller · ‎06-26-2020

Yes, that would be useful. Although I don't see why you couldn't list all contributors in a by-line.

It is unnatural (and did not occur to me) to scroll down and look in the right-side column to find the name of the author. In almost every other type of publication (newspaper, magazine, blog, internet forum) the author's name is immediately under or immediately next to the title, or even in the case of the rest of SAS Communities, the author's name is directly above the article's title.

jozgot · ‎06-26-2020

@hacla Your note and other encouraging people inspired me to research and write a blog about SAS early history! To me the most striking difference of now vs then is just how many different types of people in different industries, countries, levels of math or coding knowledge use SAS. In the early early days it was agriculture statisticians in the Southeast US.

SAS Communities Library