BookmarkSubscribeRSS Feed

The What and the Why of Model Hierarchy

Started ‎01-17-2024 by
Modified ‎01-17-2024 by
Views 429

A question about model hierarchy came up recently while I was teaching the SAS class Predictive Modeling Using Logistic Regression.  In class, we were using PROC LOGISTIC to investigate all 1.12 quadrillion possible statistical models given 50 effects by using the BEST=SCORE option.  This option produced a list of models of varying sizes, from 1 effect to 50 effects, showing the single best model of each size ("best" meaning having the highest score chi-squared). The 3-effect model with the highest score chi-squared statistic was y= X1 + X2 + X3*X4, but we ignored it because it is not a hierarchically well-formulated model.  Not everyone was familiar with model hierarchy and why it is important.  So, in this post, I’ll explain why it’s best practice to maintain model hierarchy in your regression models.

 

What does it mean to maintain model hierarchy?

 

Maintaining model hierarchy means regression models with interactions (e.g., X1*X2) include the corresponding main effects (e.g., X1 and X2) even if they are non-significant.  Model hierarchy also applies to polynomial models in that a model with higher order polynomial terms (e.g., X3) includes all lower order terms (e.g., X2 and X) even if these are nonsignificant. Models that maintain model hierarchy are sometimes referred to as “hierarchically well-formulated models”.

 

So model hierarchy is maintained in these models:

 

Y= A + B + C + D + C*D

 

Y= A + B + C + B*C

 

Y= X3 + X2+ X

 

Y= X2 + X

 

but not these models:

 

Y= C*D

 

Y= A + B + B*C

 

Y= X4

 

Y= X3 + X2

 

 

Why is it best practice to maintain model hierarchy?

 

There are several reasons to use hierarchically well-formulated models in your business or research.  In brief, (1) they allow fine tuning of the fit of the model to your data, (2) omitting the lower order terms makes strong (often incorrect!) assumptions about relationships in your data, (3) they make the interaction effect inference invariant to coding, and (4) not maintaining hierarchy opens the door to absurdly complex models.

 

Maintaining hierarchy allows fine-tuning the fit of the model to your data

 

Often, we’re trying to fit a regression model to our data without having strong theory telling us what the model should be.  We see a Y vs X scatter plot showing a straight pattern, so we include the variable X. If we see a bend in the pattern, we include X2. But while X2 captures the curvature, including X provides additional useful information about the shape of the response function.  For example, in this picture below, we can see how changing the coefficient for X in y= X2 + bX changes the shape of the X-Y relationship:

 

01_Elnaccash.Tarek-blog2-MH-coefficients-for-X-edit.png

 Modified from Wikipedia https://commons.wikimedia.org/wiki/File:Quadratic_equation_coefficients.png)

 

Select any image to see a larger version.
Mobile users: If you do not see this image, scroll to the bottom of the page and select the "Full" version of this post.

 

Keeping bX in the model provides greater flexibility to the model shape. So, including the lower order terms will likely allow better fit of your models than leaving it out.

 

Omitting the lower order terms makes strong assumptions about relationships in your data

 

Let’s start off with a regression model with 2 predictors and their interaction:

 

y=b0 + b1X1 +b2X2+ b3X1X2+ e

 

When dealing with an interaction model such as the one above, we are often taught to not interpret the coefficients b1 and b2 when the interaction is significant.  This is because a significant 2-way interaction means the effect of one predictor on the response depends on another predictor.  So it doesn’t make sense to interpret either of the interacting predictors alone. While this is true, the coefficients b1 and b2 do have meaning in this interaction model, and it is not the average effect of the predictors. The meanings of the coefficients are:

 

b1 = the effect of a unit-increase in X1 when X2=0

 

b2 = the effect of a unit-increase in X2 when X1=0.

 

[Note: the above interpretation only makes sense when X1 and X2 have true zero points (i.e., are on ratio scale) as in measures of size or distance. Interval scaled continuous variables have arbitrary zero points, so it would not be meaningful to interpret the effect of one variable while another variable was set to its arbitrary zero point.]

 

Leaving out X1 and X2 from your model is equivalent to insisting that b1 and b2 both equal zero.  This is a strong assumption that X1 has zero effect on Y when the other X2 equals zero (yet X1 is still important to the interaction).  Even if the effect is not statistically different from zero, if it is not exactly zero, leaving it out will bias your other parameter estimates. Remember that a non-significant statistical test does not mean that the null hypothesis is true.  So, to sum it up, this assumption shouldn’t be made without a justification for it.

 

When is it justified? I found it difficult to think of an example when this would be reasonable but here goes: you know those epoxy adhesives that come in two tubes?  When the two liquids are combined, they form a strong epoxy adhesive that will glue practically anything to anything.  A model of adhesive strength (y) that included the amount of epoxy components (X1 and X2) could have zero coefficients of the main effects (since they are inert alone) but still have a strong interaction effect.  Without strong justification, you’re better off maintaining model hierarchy and keeping the main effects in your model.

 

Keeping lower order terms makes interaction effect inferences invariant to coding

 

Model hierarchy is desirable because models that are hierarchically well formulated have inferences that are invariant to the coding that you choose for your predictors.  One common way to recode predictors is to center them.  Centering means subtracting the average from each value so that the transformed variable has a mean of zero. There are a variety of reasons to center data including reducing collinearity when interactions are present and making the intercept term more meaningful.  Let’s see how centering affects inferences for models that do and don’t maintain model hierarchy.

 

For this example, I’ll use the data set sashelp.cars.  I’ll use PROC STDIZE to create centered versions of the predictors Cylinders and MPG_City and save them to an output data set called centered. Then I’ll create 4 regression models, each using the response variable MSRP.  The predictors for the first model are Cylinders, MPG_City, and their interaction.  The second model is a version of the first model that uses centered variables.  Model three is not a hierarchically well-formulated model—it only has the Cylinders * MPG_City interaction and is missing main effects. The fourth model is the same as model three, except that centered variables are used.

 

The code used is shown below.  What happens to the coefficient of determination (R2) and the p-values for the interactions when the variables are centered?

 

proc stdize data=sashelp.cars out=centered (keep=msrp Cylinders MPG_City centered:)
   sprefix=centered_ oprefix;
 Var Cylinders MPG_City;
run;

*hierarchically well-formulated model;
proc glm data=centered;
 model msrp=cylinders|mpg_city;
run;
 
*hierarchically well-formulated model (centered variables);
proc glm data=centered;
 model msrp=centered_cylinders|centered_mpg_city;
 run;

*no model hierarchy;
proc glm data=centered;
 model msrp=cylinders*mpg_city;
run;

*no model hierarchy (centered variables);
proc glm data=centered;
 model msrp=centered_cylinders*centered_mpg_city;
run;

 

Here is the output from models 1 through 4.  Focus your attention on the R2 and the interaction p-values from the Type 3 SS tables:  

 

Model 1

 

02_TE_model-1.png

 

Model 2

 

03_TE_model-2.png

 

Model 3

 

04_model-3.png

 

Model 4

 

05_model-4.png

 

Centering the variables does not affect the interaction p-value or the model R2 for the hierarchically well-formulated model.  But in the interaction models without main effects, centering the predictors changes the R2 from 0.07 to 0.005 and changes the interaction p-value from p=0.01 to p=0.14. This is not a desirable outcome!

 

We can see why omitting constituent terms from an interaction is problematic if we can tolerate a little algebra.  Let’s start with the model y= b0+b1X1+b2X2 +b3X1X2, then add constant c to X1 and constant d to X2.

 

y= b0+b1(X1+c) +b2(X2+d) +b3(X1+c) (X2+d)

 

  = b0 +b1X1+b1c+b2X2+b2d+b3X1X2+b3dX1+b3cX2+b3cd

 

Next, collect all the constant terms (everything that does not involve an X), then all the functions of X1, functions of X2, then functions of X1 X2:

 

  = b0 + b1c + b3cd + b2d + b1X1 + b3dX1 + b2X2 + b3cX2 + b3X1X2

 

and simplify:

 

  =(b0 + b1c  + b3cd + b2d) + (b1 + b3d)X1  + (b2 + b3c)X2  + (b3)X1X2

 

The sums in parentheses are an intercept, the coefficient for X1, the coefficient for X2 and the coefficient for the X1 X2 interaction using the transformed variables. We can fit this model the same way we fit the original model y= b0+b1X1+b2X2 +b3X1X2, since it has 2 main effects and their interaction. Compare this to the model y= b3X1X2 (with no main effects) when we add the constants c and d to X1 and X2 respectively:

 

y= b3(X1 + c)(X2 + d)

 

  = (b3cd) + (b3d)X1 + (b3c)X2 + (b3)X1X2

 

The parentheses were added in the last step above to make it easier to see that this is a regression with an intercept, a main effect of X1, a main effect of X2 and in X1X2 interaction effect.  But we can’t estimate this model if we try to fit y= b3X1X2 because it has no main effects.  Adding a constant in the interaction model without hierarchy creates linear effects of the predictors that aren’t captured by the model being fit. These main effects aren’t being estimated and will get lumped in with the interaction.  As we’ve already seen, adding a constant will change the R2 and p-values for then non-hierarchical model.  We don’t want this!

 

The above explanation can be found in several papers such as Paul D. Allison’s (1977) “Testing for Interaction in Multiple Regression”. This is the same Allison who wrote my favorite books on logistic regression and survival analysis (Logistic Regression Using SAS: Theory and Application, Second Edition and Survival Analysis Using SAS: A Practical Guide, Second Edition).  Check out Allison’s paper for a more thorough explanation of model hierarchy and invariance to coding.

 

Not maintaining model hierarchy opens the door to absurdly complex models

 

Let’s look back at the p-value for the model 1 interaction from the type 3 sums of squares table.  The p- value is 0.0119 and if we’re using alpha=0.05 as our significance threshold, this tells us that the interaction is statistically significant.  But remember what type 3 (aka partial) sums of squares are showing.  It shows the additional sums of squares explained by an effect when added to a model containing all the other model effects.   In other words, the p-value tells us if the effect is significant after explaining as much as possible using all the remaining variables. So, the significant Cylinders*MPG_City interaction is telling us that after accounting for the main effects, the interaction is still important.

 

What if we leave out the main effects, as in model 3? The significance of the interaction is no longer assessed after accounting for main effects.  If model 3 is all we look at, we miss knowing whether a simpler, main effects only model could have explained the data.

 

Researchers normally follow the principle of parsimony or Occam’s Razor which basically says to use the simplest model that gets the job done. There’s no sense in using the model msrp=MPG_City5 even though it is statistically significant (p=0.029) when the hierarchically well-formulated quadratic model msrp= MPG_City + MPG_City2 (p<0.001) gets the job done and results in all higher order terms being nonsignificant.  If we use a 5th degree polynomial without the lower order terms, why not use y= X17? The model y= X17 is needlessly, even absurdly complex when a much simpler model does the job.  The same logic applies to modeling an interaction without the main effects.

 

Now you know why to use hierarchically well-formulated models. If what you read here was intriguing, there are several places to go for more information. For learning effective model selection approaches with PROC LOGISTIC, check out the course Predictive Modeling Using Logistic Regression. For a deeper understanding of the do’s and don’ts of statistical inferences in regression modeling try the course Statistics 2: ANOVA and Regression. For a foundational course on many of the same topics, try Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression.  If you are interested in a foundational statistics course geared towards predictive modeling and machine learning, try out Statistics You Need to Know for Machine Learning. Live Instructor Dates for these classes, and more, can be found by using the links provided.

 

See you in the next SAS class!

 

References

 

Allison, Paul D. 1977 “Testing for Interaction in Multiple Regression” American Journal of Sociology, Vol. 83: 1, pp. 144-153

 

Allison, Paul D. 2010. Survival Analysis Using SAS®: A Practical Guide, Second Edition. Cary, NC: SAS Institute Inc.

 

Allison, Paul D. 2012. Logistic Regression Using SAS®: Theory and Application, Second Edition. Cary, NC: SAS Institute Inc.

 

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
‎01-17-2024 06:25 PM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started