BookmarkSubscribeRSS Feed

A Gentle Introduction to Structural Equation Models, Part 2: Linear Regression

Started ‎12-20-2023 by
Modified ‎12-20-2023 by
Views 585

 

This is the second in a multi-part series on structural equation models, or SEMs.

 

How does SEM work?

 

A structural equation model (SEM) is a modeling technique for explaining and testing hypotheses about complex relationships among variables (observed and unobserved) that make up a system or phenomenon. The models can cite their lineage back to psychometrics, econometrics, and biometrics, and are especially interesting for directly testing a complex hypothesis of interest in one go.

 

If you are new to SEM or to PROC CALIS, I suggest going back to read Part 1. It’s cool, I’ll wait until you come back.

 

In this post, you’ll learn to reframe regression models as path diagrams, and then add constraints to your working hypothesis or theory, going beyond what OLS regression can easily do. Finally, you will learn to use the PATH language in PROC CALIS to express a model as a set of graphical components that represent your new-and-improved theory specification.

 

A linear regression example

 

Let’s consider linear regression with one predictor:

 

01_CT_Screenshot-2023-11-30-132416.png

 

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

If you wanted to represent this in a picture, you could put each variable in a box, and use a single-headed arrow to show the predictive direction.

 

02_CT_Screenshot-2023-11-30-132434-300x59.png

 

Now, let’s make it a multiple regression, extending to 2 predictors:

 

03_CT_Screenshot-2023-11-30-132443-300x39.png

 

Expressing a regression model as a diagram

 

We can make our picture a little richer. Let’s represent the random error in the model as a circle pointing to the response, Y. We can also use a double-headed arrow to represent the error variance. For completeness, we’ll add double-headed arrows to represent the variance of the predictors, like this:

 

04_CT_Screenshot-2023-11-30-132457-300x172.png

 

That’s a diagram of a multiple linear regression with 2 predictors. Extending the example to a third predictor:

 

05_CT_Screenshot-2023-11-30-132506.png

 

Can you envision what this diagram will look like? Well, we can make the picture even more informative. In multiple linear regression, the predictors are often correlated (as you well know if you’re someone whose data suffers from collinearity!). This time, let’s show the covariances among the predictors on our diagram by adding some more double-headed arrows:

 

06_CT_Screenshot-2023-11-30-132519-300x250.png

 

Writing the code for linear regression

 

These simple examples are easily fitted with any linear regression software such as:

 

PROC REG;
	MODEL y = x1;
	MODEL y = x1 x2;
    MODEL y = x1 x2 x3;
RUN;

You can also specify these models in PROC CALIS. For example, here’s the code for the model with 3 predictors:

 

PROC CALIS;
        PATH y <=== X1 X2 X3;
        PATHDIAGRAM DIAGRAM = initial EXOGCOV USEERR;
RUN;

 

But there’s so much more that you can do with SEM and a regression model.

 

Constraints on a regression model

 

Suppose, for the model above, you have a hypothesis that Y is predicted by all 3 variables, X1, X2, and X3, but X1 and X2 are, in theory, completely independent of X3. Regression will certainly let predictors be correlated or uncorrelated and the analysis isn’t bothered either way. But, if independence of the predictors is central to your theory, then you will want to specify a model with 2 constraints:

 

07_CT_Screenshot-2023-11-30-132532-300x254.png

 

See those zeroes on the left in the diagram? They represent constraints you place on the mathematical representation of your theory: that X1 and X2 are each uncorrelated with X3. Now you can test the hypothesis that this constrained model is a reasonably good fit to your data. To put it a little more accurately but more confusingly, you can test the hypothesis that the constrained model is not significantly different from the unconstrained covariance matrix of the variables.

 

SEM tests hypotheses by operating on a variance-covariance matrix of all the variables in the analysis. Every constraint you place on the model, such as structural zeroes, equality constraints, and others are putting fixed values in that variance-covariance matrix.

 

But first, let’s talk about information units.

 

In linear regression, the fundamental unit of information is a record (or row, case, observation) of the raw data set. This might be a patient, a customer, an account, a plant, etc. In SEM, however, the fundamental unit of information is a non-redundant element of the variance-covariance matrix of the data.

 

Take a variance-covariance matrix of 4 variables, X1-X4.

 

08_CT_Screenshot-2023-11-30-132544-300x219.png

 

There are 4x4=16 elements in that matrix. But since the matrix is symmetric, only 10 of those elements are unique: there are 4 variances and 6 unique covariances in this matrix. That means we have 10 information units for analysis.

 

SEM analysis compares the full covariance matrix, with 10 information units, to the hypothesized covariance matrix, which has something less than 10, because we put in fixed values, equality constraints, structural zeroes, etc.

 

Let’s return to the previously hypothesized model:

 

09_CT_Screenshot-2023-11-30-132532-300x254.png

 

If I hypothesize that X1 and X3 are unrelated, then I have a structural zero at Cov(1,3) in the hypothesized matrix. The hypothesis that X2 and X3 are unrelated results in another structural zero at Cov(2,3).

 

10_CT_Screenshot-2023-11-30-132607-300x218.png

 

How to Test Goodness of Fit

 

Comparing observed and hypothesized matrices probably puts you in the mood for a Chi-square test, and that’s one of the top tools in assessing SEM fit.

 

The degrees of freedom (df) for the Chi-square is the number of information units in the full, unrestricted matrix minus the number of estimated (not fixed) parameters in the hypothesized matrix. In this case, that’s 10-8 = 2df. The more constraints you place on the hypothesized model, the larger your Chi-square df will be, and therefore the more powerful that test will be. But, with greater power comes greater responsibility. Responsibility to only constrain things in ways that are reasonably well-supported by the data, and by the real phenomena you have theorized. This is because the more constraints you specify, the harder it is for data, with random variation, to show good fit. You can get better fit, of course, by constraining fewer parameters.

 

But there’s still no free lunch – The best-fitting model has 0 degrees of freedom, because that’s when the full and hypothesized matrices are identical. Perfect Chi-square! But is it statistically significant? Nope! That’s undefined.

 

An example of code to fit this model is:

 

PROC CALIS;
        PATH y <=== X1 X2 X3;
        PCOV X1 X3 = 0, X2 X3 = 0;
        PATHDIAGRAM DIAGRAM = initial EXOGCOV USEERR;
RUN;

 

The Fit Summary table from PROC CALIS includes dozens of fit statistics for the model, and the results are rich with information. But I’m going to focus on the Chi-Square test of fit here.

 

11_CT_Screenshot-2023-11-30-132622-300x75.png

 

This is a hypothesis that you want to fail to reject! That’s because the overall Chi-Square test hypothesizes that your observed covariance matrix is equal to the hypothesized covariance matrix, with the constraints. If you reject that hypothesis, it means you have a poorly fitting model. In this case, we fail to reject the null hypothesis, which supports your hypothesized model.

 

Assessing fit of the SEM entails much more than just looking at the Chi-Square test for overall goodness of fit. There are so many different fit statistics that is can be overwhelming trying to determine what to look at. When I assess SEMs, I usually consult a minimum of 4 different fit indices. Each fit index tells you something different about your model fit. It is part of what makes SEMs so flexible.

 

There are any other interesting models you could consider in the area of constrained linear regression. For example, perhaps you want to test a hypothesis of equal effects: that the effect of X1 on Y is the same as the effect of X2 and of X3 on Y. This could be specified by giving the parameters the same name, like this:

 

PROC CALIS; 
      PATH y <=== X1=beta1,  y<===X2=beta1, y <=== X3=beta1; 
      PATHDIAGRAM DIAGRAM = initial EXOGCOV USEERR; 
RUN;

 

SEMs are commonly represented as path diagrams, and are typically comprised of observed variables, latent variables, and error terms who relationships are characterized by direct and indirect paths, means, covariances, equality constraints, and more. In a future blog post, I’ll show you how to make some path diagrams from increasingly complex models.

 

I hope you enjoyed this, and thanks for reading!

 

 

Find more articles from SAS Global Enablement and Learning here.

Comments

Nice article! I think I learned a few things. Thanks!  👍

Version history
Last update:
‎12-20-2023 12:28 PM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags