The What and the Why of Model Hierarchy

2 Likes

A question about model hierarchy came up recently while I was teaching the SAS class Predictive Modeling Using Logistic Regression. In class, we were using PROC LOGISTIC to investigate all 1.12 quadrillion possible statistical models given 50 effects by using the BEST=SCORE option. This option produced a list of models of varying sizes, from 1 effect to 50 effects, showing the single best model of each size ("best" meaning having the highest score chi-squared). The 3-effect model with the highest score chi-squared statistic was y= X₁ + X₂ + X₃*X₄, but we ignored it because it is not a hierarchically well-formulated model. Not everyone was familiar with model hierarchy and why it is important. So, in this post, I’ll explain why it’s best practice to maintain model hierarchy in your regression models.

What does it mean to maintain model hierarchy?

Maintaining model hierarchy means regression models with interactions (e.g., X₁*X₂) include the corresponding main effects (e.g., X₁and X₂) even if they are non-significant. Model hierarchy also applies to polynomial models in that a model with higher order polynomial terms (e.g., X³) includes all lower order terms (e.g., X² and X) even if these are nonsignificant. Models that maintain model hierarchy are sometimes referred to as “hierarchically well-formulated models”.

So model hierarchy is maintained in these models:

Y= A + B + C + D + C*D

Y= A + B + C + B*C

Y= X³ + X²+ X

Y= X² + X

but not these models:

Y= C*D

Y= A + B + B*C

Y= X⁴

Y= X³ + X²

Why is it best practice to maintain model hierarchy?

There are several reasons to use hierarchically well-formulated models in your business or research. In brief, (1) they allow fine tuning of the fit of the model to your data, (2) omitting the lower order terms makes strong (often incorrect!) assumptions about relationships in your data, (3) they make the interaction effect inference invariant to coding, and (4) not maintaining hierarchy opens the door to absurdly complex models.

Maintaining hierarchy allows fine-tuning the fit of the model to your data

Often, we’re trying to fit a regression model to our data without having strong theory telling us what the model should be. We see a Y vs X scatter plot showing a straight pattern, so we include the variable X. If we see a bend in the pattern, we include X². But while X² captures the curvature, including X provides additional useful information about the shape of the response function. For example, in this picture below, we can see how changing the coefficient for X in y= X² + bX changes the shape of the X-Y relationship:

01_Elnaccash.Tarek-blog2-MH-coefficients-for-X-edit.png

Modified from Wikipedia https://commons.wikimedia.org/wiki/File:Quadratic_equation_coefficients.png)

Select any image to see a larger version.
Mobile users: If you do not see this image, scroll to the bottom of the page and select the "Full" version of this post.

Keeping bX in the model provides greater flexibility to the model shape. So, including the lower order terms will likely allow better fit of your models than leaving it out.

Omitting the lower order terms makes strong assumptions about relationships in your data

Let’s start off with a regression model with 2 predictors and their interaction:

y=b₀ + b₁X₁ +b₂X₂+ b₃X₁X₂+ e

When dealing with an interaction model such as the one above, we are often taught to not interpret the coefficients b₁and b₂ when the interaction is significant. This is because a significant 2-way interaction means the effect of one predictor on the response depends on another predictor. So it doesn’t make sense to interpret either of the interacting predictors alone. While this is true, the coefficients b₁and b₂ do have meaning in this interaction model, and it is not the average effect of the predictors. The meanings of the coefficients are:

b₁= the effect of a unit-increase in X₁ when X₂=0

b₂ = the effect of a unit-increase in X₂ when X₁=0.

[Note: the above interpretation only makes sense when X₁ and X₂ have true zero points (i.e., are on ratio scale) as in measures of size or distance. Interval scaled continuous variables have arbitrary zero points, so it would not be meaningful to interpret the effect of one variable while another variable was set to its arbitrary zero point.]

Leaving out X₁ and X₂ from your model is equivalent to insisting that b₁and b₂ both equal zero. This is a strong assumption that X₁has zero effect on Y when the other X₂equals zero (yet X₁ is still important to the interaction). Even if the effect is not statistically different from zero, if it is not exactly zero, leaving it out will bias your other parameter estimates. Remember that a non-significant statistical test does not mean that the null hypothesis is true. So, to sum it up, this assumption shouldn’t be made without a justification for it.

When is it justified? I found it difficult to think of an example when this would be reasonable but here goes: you know those epoxy adhesives that come in two tubes? When the two liquids are combined, they form a strong epoxy adhesive that will glue practically anything to anything. A model of adhesive strength (y) that included the amount of epoxy components (X₁ and X₂) could have zero coefficients of the main effects (since they are inert alone) but still have a strong interaction effect. Without strong justification, you’re better off maintaining model hierarchy and keeping the main effects in your model.

Keeping lower order terms makes interaction effect inferences invariant to coding

Model hierarchy is desirable because models that are hierarchically well formulated have inferences that are invariant to the coding that you choose for your predictors. One common way to recode predictors is to center them. Centering means subtracting the average from each value so that the transformed variable has a mean of zero. There are a variety of reasons to center data including reducing collinearity when interactions are present and making the intercept term more meaningful. Let’s see how centering affects inferences for models that do and don’t maintain model hierarchy.

For this example, I’ll use the data set sashelp.cars. I’ll use PROC STDIZE to create centered versions of the predictors Cylinders and MPG_City and save them to an output data set called centered. Then I’ll create 4 regression models, each using the response variable MSRP. The predictors for the first model are Cylinders, MPG_City, and their interaction. The second model is a version of the first model that uses centered variables. Model three is not a hierarchically well-formulated model—it only has the Cylinders * MPG_City interaction and is missing main effects. The fourth model is the same as model three, except that centered variables are used.

The code used is shown below. What happens to the coefficient of determination (R²) and the p-values for the interactions when the variables are centered?

proc stdize data=sashelp.cars out=centered (keep=msrp Cylinders MPG_City centered:)
   sprefix=centered_ oprefix;
 Var Cylinders MPG_City;
run;

*hierarchically well-formulated model;
proc glm data=centered;
 model msrp=cylinders|mpg_city;
run;
 
*hierarchically well-formulated model (centered variables);
proc glm data=centered;
 model msrp=centered_cylinders|centered_mpg_city;
 run;

*no model hierarchy;
proc glm data=centered;
 model msrp=cylinders*mpg_city;
run;

*no model hierarchy (centered variables);
proc glm data=centered;
 model msrp=centered_cylinders*centered_mpg_city;
run;

Here is the output from models 1 through 4. Focus your attention on the R² and the interaction p-values from the Type 3 SS tables:

Model 1

Model 2

Model 3

Model 4

Centering the variables does not affect the interaction p-value or the model R² for the hierarchically well-formulated model. But in the interaction models without main effects, centering the predictors changes the R² from 0.07 to 0.005 and changes the interaction p-value from p=0.01 to p=0.14. This is not a desirable outcome!

We can see why omitting constituent terms from an interaction is problematic if we can tolerate a little algebra. Let’s start with the model y= b₀+b₁X₁+b₂X₂ +b₃X₁X₂, then add constant c to X₁and constant d to X₂.

y= b₀+b₁(X₁+c) +b₂(X₂+d) +b₃(X₁+c) (X₂+d)

= b₀ +b₁X₁+b₁c+b₂X₂+b₂d+b₃X₁X₂+b₃dX₁+b₃cX₂+b₃cd

Next, collect all the constant terms (everything that does not involve an X), then all the functions of X₁, functions of X₂, then functions of X₁ X₂:

= b₀ + b₁c + b₃cd + b₂d + b₁X₁ + b₃dX₁ + b₂X₂ + b₃cX₂ + b₃X₁X₂

and simplify:

=(b₀ + b₁c + b₃cd + b₂d) + (b₁+ b₃d)X₁ + (b₂+ b₃c)X₂ + (b₃)X₁X₂

The sums in parentheses are an intercept, the coefficient for X₁, the coefficient for X₂ and the coefficient for the X₁ X₂interaction using the transformed variables. We can fit this model the same way we fit the original model y= b₀+b₁X₁+b₂X₂ +b₃X₁X₂, since it has 2 main effects and their interaction. Compare this to the model y= b₃X₁X₂ (with no main effects) when we add the constants c and d to X₁ and X₂ respectively:

y= b₃(X₁+ c)(X₂+ d)

= (b₃cd) + (b₃d)X₁+ (b₃c)X₂+ (b₃)X₁X₂

The parentheses were added in the last step above to make it easier to see that this is a regression with an intercept, a main effect of X₁, a main effect of X₂ and in X₁X₂ interaction effect. But we can’t estimate this model if we try to fit y= b₃X₁X₂ because it has no main effects. Adding a constant in the interaction model without hierarchy creates linear effects of the predictors that aren’t captured by the model being fit. These main effects aren’t being estimated and will get lumped in with the interaction. As we’ve already seen, adding a constant will change the R² and p-values for then non-hierarchical model. We don’t want this!

The above explanation can be found in several papers such as Paul D. Allison’s (1977) “Testing for Interaction in Multiple Regression”. This is the same Allison who wrote my favorite books on logistic regression and survival analysis (Logistic Regression Using SAS: Theory and Application, Second Edition and Survival Analysis Using SAS: A Practical Guide, Second Edition). Check out Allison’s paper for a more thorough explanation of model hierarchy and invariance to coding.

Not maintaining model hierarchy opens the door to absurdly complex models

Let’s look back at the p-value for the model 1 interaction from the type 3 sums of squares table. The p- value is 0.0119 and if we’re using alpha=0.05 as our significance threshold, this tells us that the interaction is statistically significant. But remember what type 3 (aka partial) sums of squares are showing. It shows the additional sums of squares explained by an effect when added to a model containing all the other model effects. In other words, the p-value tells us if the effect is significant after explaining as much as possible using all the remaining variables. So, the significant Cylinders*MPG_City interaction is telling us that after accounting for the main effects, the interaction is still important.

What if we leave out the main effects, as in model 3? The significance of the interaction is no longer assessed after accounting for main effects. If model 3 is all we look at, we miss knowing whether a simpler, main effects only model could have explained the data.

Researchers normally follow the principle of parsimony or Occam’s Razor which basically says to use the simplest model that gets the job done. There’s no sense in using the model msrp=MPG_City⁵ even though it is statistically significant (p=0.029) when the hierarchically well-formulated quadratic model msrp= MPG_City + MPG_City² (p<0.001) gets the job done and results in all higher order terms being nonsignificant. If we use a 5^th degree polynomial without the lower order terms, why not use y= X¹⁷? The model y= X¹⁷ is needlessly, even absurdly complex when a much simpler model does the job. The same logic applies to modeling an interaction without the main effects.

Now you know why to use hierarchically well-formulated models. If what you read here was intriguing, there are several places to go for more information. For learning effective model selection approaches with PROC LOGISTIC, check out the course Predictive Modeling Using Logistic Regression. For a deeper understanding of the do’s and don’ts of statistical inferences in regression modeling try the course Statistics 2: ANOVA and Regression. For a foundational course on many of the same topics, try Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression. If you are interested in a foundational statistics course geared towards predictive modeling and machine learning, try out Statistics You Need to Know for Machine Learning. Live Instructor Dates for these classes, and more, can be found by using the links provided.

See you in the next SAS class!

References

Allison, Paul D. 1977 “Testing for Interaction in Multiple Regression” American Journal of Sociology, Vol. 83: 1, pp. 144-153

Allison, Paul D. 2010. Survival Analysis Using SAS^®: A Practical Guide, Second Edition. Cary, NC: SAS Institute Inc.

Allison, Paul D. 2012. Logistic Regression Using SAS^®: Theory and Application, Second Edition. Cary, NC: SAS Institute Inc.

Find more articles from SAS Global Enablement and Learning here.

The What and the Why of Model Hierarchy

Free course: Data Literacy Essentials

Get Started