SAS Communities Library

We’re smarter together. Learn from this collection of community knowledge and add your expertise.
BookmarkSubscribeRSS Feed

A Gentle Intro to SEM, Part 4: Overall Chi-Square Test of Model Fit

Started ‎10-23-2024 by
Modified ‎10-25-2024 by
Views 2,229

The purpose of this post is to explain and illustrate the use of the Chi-square test of absolute model fit for a structural equation model. This is Part 4 in a series about structural equation models (SEM). If SEM is new to you, or you want to catch up with the storyline, then please check out the earlier posts

 

 

Assessing your model’s fit is the climactic cinematic moment in every modeler’s day. It’s when you get to see which model will reign supreme, and which will fall. You learn whether the model holds water at all, or is just a washout of random noise. Let’s face it- the only reason most of us go through the painful drudgery of designing a study, collecting data, and then getting the messy data into shape for analysis is so that we can enjoy that brief, thrilling moment when we get to run the modeling code and see WHAT HAPPENED! You know what I’m talking about, don’tyou?

 

Assessing Model Fit Depends on the Model

 

Assessing the fit of a model looks different depending on the type of model you are using and the purpose of the model. Continuous responses call for different metrics than categorical responses, and models designed for correct inference requires different assessment than model intended for explanation or for prediction.

 

For example, with a linear mixed model where inference is key, I usually start by fitting a relatively full model, and look at the covariance parameter estimates. Then I refine the covariance part of the model using REML-based penalized likelihood statistics, or construct chi-square tests of full and reduced models. Once I am happy with the covariance part of the model, I evaluate the fixed effect tests using F-tests or, if I’m feeling frisky, using ML-based penalized likelihood statistics comparing full and reduced models.

 

Alternately, If I’m assessing model fit for a predictive model using logistic regression, I might use penalized fit statistics to compare candidate models, and I will likely also look at accuracy-related statistics such as the c-statistic, cumulative lift, F1 score, and so on.

 

Testing Overall Model Fit for SEM

 

When it comes to SEM, there are many types of models you might fit, and many different ways that you can evaluate model fit. In this post, I’ll be dealing with overall model fit, although there are also many statistics for assessing one parameter at a time in your SEM. But one common thread in SEM is that we are analyzing the structure of the variables – the interrelationships among exogenous and endogenous, latent and manifest variables. That means that the assessment metrics won’t look the same as for a model like logistic regression.

 

Testing overall model fit in SEM generally entails comparing the full (sample-based) covariance matrix to the restricted (hypothesized) covariance matrix. You hypothesize a covariance structure by placing restrictions on the unrestricted covariance matrix (and optionally the mean vector), as you’ve seen in earlier posts with PROC CALIS. The most straightforward question we can ask about model fit is, “How well does the estimated covariance matrix, with its structural zeroes, equality constraints, and boundary constraints, reproduce the sample-based covariance matrix of variables in the analysis?”

 

Remember the Chi-square Test for a Frequency Table?

 

Taking a little side-path for a second, this is a bit similar to how we test for bivariate categorical associations. In that case, we create a crosstabulation of frequencies from the data, and then make another crosstab using only the marginal counts for 2 independent categorical variables. Next, we compute the difference between those matrices, and perform a Chi-square test.

 

The approach for SEM is pretty similar! There is the estimated covariance matrix of the variables, and then the (estimated) model-implied covariance matrix. Take the difference between those, count the constraints to get df for a Chi-square test.

 

The df for the Chi-square is an interesting detail. In SEM, each parameter being estimated must have at least one algebraic solution expressing the parameter as a function of the sample covariance estimates. But really, we want to have more known than unknown information to test the fit of the model. In other words, we want the constraints on the model to result in more known information than unknown information to enable testing this hypothesis of model fit.

 

Model Identification

 

Another side path -- to illustrate this df concept, let’s consider solving linear equations:

 

x+y=15

 

In the single equation above, it is not possible to uniquely solve for x and y given only that their sum is 15. There are 2 unknowns and one known.

 

However, if you are given two equations:

 

x+y=15

3x-y=25

 

then it is possible to uniquely and exactly solve for x and y. In terms of our SEM analogy, that’s the equivalent of a Chi-square with 0 df and is known as a just-identified model. That means we have no restrictions on the model. It’s also equivalent to estimating a regression line with 2 data points, known as a saturated model. It is satisfying from a mathematical perspective because there is only one possible answer, and it perfectly fits the data. However, it’s very unsatisfying in statistical analysis. That’s because in statistics, the parameters are generally not truly known, but are estimated by sampling from a population of interest. Just as a saturated model has a perfect fit to the raw data, a just-identified SEM has a perfect fit to the covariance matrix of input variables. It is preferable to have some uncertainty in the estimates for testing hypotheses.

 

What if we add one more equation?

 

x+y=15

3x-y=25

x+4y=8

 

It is possible to solve, inexactly, for x and y. In the example above, there are two unknowns and three knowns, and variance in the solution for x and y. This is akin to estimating a regression line with 3 data points. The solution is not a perfect fit to the data. Although this might be unsatisfying mathematically, it is exactly what is needed for statistical modeling. There is one degree of freedom for testing hypotheses about the goodness of fit of the model to the data.

 

So, as we have seen, one condition of being able to construct the overall Chi-square test with df > 0 is that there are more units of information than there are parameters being estimated in the model. The analogy of solving linear equations, and of solving regression, is a useful tool for understanding model identification in SEM. To complete the learning, consider that the input data for SEM is the sample covariance matrix of the manifest variables. Your hypothesized model is translated into constraints placed on that matrix (usually by placing structural zeros in the hypothesized matrix). Assessment of model fit entails determining whether the sample-based covariance matrix is sufficiently close to the hypothesized covariance matrix that you fail to reject the null hypothesis that this constrained matrix is the covariance matrix for the population.

 

In SEM, the units of information are the nonredundant elements of the sample covariance matrix. If we placed one constraint on a 2x2 covariance matrix (such as covariance between x1, x2 = 0), we would have 3 data points for estimating 2 covariance parameters, hence a Chi-square with 1 df.

 

Counting Parameters in SEM

 

To determine the number of parameters being estimated, it is necessary to count all direct paths, variances, and covariances. If you perform mean structure analysis, it is also necessary to count the number of means and intercepts that are estimated. The number of unique elements on the sample covariance matrix is also called the number of observations for the model, and is 1/2p(p+1) where p is the number of variables. For a model to be overidentified, there should be fewer estimated parameters than 1/2p(p+1).

 

Example: Testing for Independent Factors in CFA

 

In this post, I showed you how to set up a confirmatory factor analysis using PROC CALIS. Let’s take a quick look at that example, modifying the code to fit a 2-factor model with uncorrelated factors.

 

proc calis data =bse1.stem;
 path
   math_imp ---> I1 I2 I3 I4 I5,
   parents ---> p1 p2 p3 p4 p5;
 pvar
   math_imp = 1,
   parents = 1;
 pcov math_imp parents = 0;
 pathdiagram exogcov label=[math_imp="Importance of Math"
                           parents="Parental Supp. of Math Ed."];
run;

 

In this model, we have 20 estimated parameters: 10 factor loadings, and 10 error variance parameters. The output in PROC CALIS also says that we have 55 observations. Do not confuse the number of observations with the number of records used in the analysis, which for this data set is 5,934. In this case, observations is ½ p(p+1). With 10 variables, that’s 0.5 * 10 * 11 = 55

 

In the output is a path diagram labeled with parameter estimates and an inset box with selected fit statistics.

 

01_CT_SEMBlogOriginal4a.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

The overall Chi-square degrees of freedom is 689.59 with 55-20=35 degrees of freedom. At any reasonable value of alpha, we reject the null hypothesis that the model-implied (restricted) covariance matrix equals the sample covariance matrix.

Since the Chi-square rejects the null hypothesis at alpha = 0.05, can we improve the fit of the model by allowing the factors to be correlated? Removing the PCOV statement will also remove the restriction of uncorrelated factors.

 

proc calis data =bse1.stem;
 path
   math_imp ---> I1 I2 I3 I4 I5,
   parents ---> p1 p2 p3 p4 p5;
 pvar
   math_imp = 1,
   parents = 1;
 pathdiagram exogcov label=[math_imp="Importance of Math"
                            parents="Parental Supp. of Math Ed."];
run;

 

02_CT_SEMblogOriginal4b.png

 

With the addition of the covariance between the factors, we now have 21 parameter estimates with 55 observations, and the Chi-square test has 34 degrees of freedom. The null hypothesis is still rejected.

 

I’ve spent a lot of time telling you about the overall Chi-square test of model fit, but in practice, many researchers consider this test to be less useful than many other fit statistics. This is because, with a reasonably large number of records, the standard errors of the estimates in the full covariance matrix are very small, making this Chi-square a very, very powerful test, even in the presence of trivial deviations from the null hypothesis. There are many other fit statistics that evaluate model fit in different ways that should also be used. Essentially, these fit statistics measure the degree of model it, instead of testing the absolute fit of the model like the chi-square test. That post is coming along soon!

 

What is really nice about the Chi-square is that we can use it for testing hypotheses about models that are hierarchically nested, meaning that placing constraints on one model results in the other. In this case, we want to know whether we significantly improved the model fit by allowing correlated factors. The difference between the Chi-squares for the 2 models is a Chi-square with 1 df.

 

data _null_; 
ChiFull=280.17; 
ChiRest=689.59; 
dfFull=34; 
dfRest=35; 
ChiSq=ChiRest-ChiFull; 
df=dfRest-dfFull; 
pvalue=1-probchi(ChiSq,df); 
put _ALL_; 
run;

 

In the log, we see the result from the Put function:

 

ChiFull=280.17 ChiRest=689.59 dfFull=34 dfRest=35 ChiSq=409.42 df=1 pvalue=0

 

Allowing correlated errors improved the model significantly. That is something I can use! There are lots of other fit statistics that you should know about. We’ll talk about those next time!

 

In summary, the Chi-square test of model fit for SEM is one of many useful tools for assessing a model. To learn more about structural equation modeling, sign up for the SAS Learning Subscription where you will find courses about SEM and so much more. Drop me a comment if you find this post helpful. Thanks for reading and see you soon!  

 

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
‎10-25-2024 02:26 PM
Updated by:
Contributors

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started