A question appeared on a discussion board that has appeared multiple times before. Here is the situation. A scientist has performed a study where she wishes to understand the relationship between two continuous variables under three different conditions, A, B, and C. She performed a regression analysis for each of the different conditions. Upon seeing the results, she had another question: do the differing conditions affect the slope of the regression line? In other words, are the three slopes statistically different? This is a common statistical analysis called analysis of covariance. But the question is asked many times by scientists who are not exposed to this type of analysis. This post will show how the analysis is completed and how to interpret the parameter estimates of the model.
Let’s start with some fictional data that matches the description of this problem. There are three variables in the dataset, group, x, and y. There are three levels of group, A, B, and C that represent the different conditions. The scientist wants to explore the relationship between x and y. Here is the code to get the data created.
data groupdata;
input group$ x y @@;
datalines;
A 25 22.2 A 32 22.4 A 22 21.9 A 15 21.3 A 20 21.8
A 28 22.1 A 40 23.1 A 41 22.7 A 30 22.4 A 13 21.4
B 43 26.6 B 30 25.8 B 43 26.5 B 13 25.0 B 18 25.4
B 25 25.6 B 30 25.8 B 28 25.8 B 30 26.0 B 29 25.8
C 32 19.5 C 34 18.5 C 38 21.6 C 25 15.8 C 37 20.5
C 42 21.8 C 32 17.1 C 47 22.3 C 35 19.5 C 20 15.5
;
The scientist performed three separate regressions, once for each level of the group variable. We can reproduce the results using PROC REG.
proc reg data=groupdata;
model y=x;
by group;
run;
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
These results for group A show a statistically significant result and the slope of the line is estimated as 0.05661.
Similarly, the group B results are statistically significant with a slope of 0.04973. And finally, the group C results are also significant with a slope of 0.29595.
Comparing all three of the slope estimates, there is certainly some difference, but is that difference just due to sampling error? How can we statistically compare the slopes? This is where ANCOVA can help. What is ANCOVA? Well, ANCOVA is fitting a single model to this data rather than separate models. This will have the advantage of being less work (which I really like!), more powerful since it controls the variability from all covariates simultaneously which reduces error (especially if the intercepts differ between the groups), and allows direct comparison and testing of the differing slopes and intercepts. Fitting a single model can be done in multiple ways, but the group variable is a class variable. So PROC REG cannot be used since it does not allow a CLASS statement. We will switch to an old favorite: PROC GLM.
proc glm data=groupdata;
class group;
model y=group x group*x / solution;
run;
The solution option on the model line will display the parameter estimates of the model. An advantage of PROC GLM is that it recognizes the ANCOVA setup and will then provide a graph of the three regression lines overlaid into a single picture.
The picture makes it a bit more obvious that the slope for group C is different. It also seems obvious that the three intercepts are different. We will consider the statistical significance of the model terms first.
The testing shows that all of the model terms are significant. The test on the group term tells us that there are differences between the groups. The test on the x term indicates that there is a statistically significant relationship between the x and y variables. Finally, the significant interaction tells us that the relationship between x and y is different for at least one group.
Let’s look at the parameter estimates of the analysis to see where the differences really are.
The parameter estimates have a column with B’s indicating that the estimates are biased. You will also notice that group C does not have an estimate and neither does x*group C. This is explained in the note below the parameter estimates table.
What this note is saying is that due to the coding that was done for you behind the scenes (changing the class variable into something numeric), one of the levels needed to be “zeroed”, or essentially, removed from the model. By default the last level, group C is removed. That is why there are zeros for that parameter estimate.
Now that we see these estimates, let’s interpret this model. We will start with group C. Because the parameter estimates for group C are all 0, the model simplifies to an intercept of 9.08841060 and the slope for x is 0.29595291. These results match the PROC REG results for just the group C data!
For group A, the estimate is 11.53564624. This is an adjustment to the intercept for group A. The x*group A estimate is -.23933850 which is an adjustment to the slope of the x-y relationship. So the regression equation for group A would be (9.0884 + 11.5356) + (0.29595291-0.23933850)*x. This simplifies to 20.624 + 0.05661441*x. Notice that this exactly matches the regression equation from PROC REG for the group A data. The significance of these terms tells us that both the intercept and the slope for group A are different from group C.
Finally, the group B and the x*group B terms are significant indicating that group B is also different from group C. The model is found in a similar fashion: (9.0884+15.30446556) + (0.29595291-0.24622544)*x which simplifies to 24.39286556 + 0.04972747*x which matches the group B results from PROC REG.
To summarize, the intercept and slope estimates are for the baseline group, in this case group C. While the non-zero estimates for the grouping variable are adjustments to the intercept of the model while the non-zero interaction estimates are providing us with adjustments to the slope of the relationship we are studying.
Thus, the ANCOVA results are easier to perform (one model versus many), more powerful (accounts for all variability simultaneously), and lead to the exact same models as if we had fit the data separately!
If you want to learn more about ANCOVA and some of the additional analyses related to this, you can take the course Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression.
Find more articles from SAS Global Enablement and Learning here.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.