Repeat After Me: Understanding Correlation Matrices in Repeated Measures Models

5 Likes

Back in December 2024, I posted about Mixed Model analysis. During that post, I mentioned the topic of correlation and covariance matrices. Back then, I noted that this topic would best be discussed within its own post. Here is that post.

In longitudinal analysis or repeated measurement structures, the concept of controlling for dependence among observations using correlation or covariance matrices will quickly be your focus. In this post, we will discuss the importance of these matrices and discuss examples of how they are typically used in analyses.

Why a covariance/correlation matrix?

Within longitudinal or repeated measure problems, data collection is made on the same experimental unit (or subject) over time. For example, a researcher is interested in comparing the efficacy of a new drug moderating blood pressure versus a placebo. The researcher randomly assigns the treatment or placebo to each participant (subject). After administering the assigned treatment, the blood pressure of each participant is collected every hour for the next 8 hours.

In this setup, each data observation cannot be considered independent of all others. Measures that are taken on the same subject will tend to be more similar compared to measures taken on different subjects. Secondly, measures taken closer in time on the same subject can also be more highly correlated than measures taken farther apart in time.

To account for this deviation away from full independence, we will introduce covariance/correlation matrices into the analysis. Inclusion of either of these can control for this additional relationship. That is, the described correlation is included in the explained variation of the model via these matrices, and this should lead to valid hypothesis test results.

What is the difference between covariance and correlation?

Sometimes you will hear the terms covariance matrix and correlation matrix used interchangeably. What is the difference between these matrices?

Covariance indicates the extent that two variables are dependent on each other. The values for covariance range from negative infinity to positive infinity. A higher number reflects a higher dependency. The downside of covariance is that changing the scale of the values affects the covariance. Covariance is not a unitless measurement.

Correlation is a measure that indicates how strongly two variables are related. The values of correlation range between -1 to +1. Unlike covariance, changing the scale of the values will not affect the correlation. The correlation matrix is a standardized form of the covariance matrix thus making it a unitless measurement.

When people typically choose between covariance and correlation, most will prefer correlation due to it not being affected by changes in scale. Also, more people are accustomed to the word correlation than covariance.

Types of Covariance/Correlation Matrices

When we perform our statistical analysis, we will include what we perceive as the working covariance/correlation structure for the data. To start, we will still assume that observations from two different subjects are uncorrelated; however, what happens to observations within the same subject? Below is an example of a block diagonal correlation matrix. In our example, assume that each subject is observed four times. We place each observation, organized by subject, into a large matrix. The R1 zone will be the four observations for the first subject. Zone R2 will be the four observations for the second subject. This pattern continues diagonally for each subject.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Note that observations from subject one are not related to observations from any other subject. This is indicated by the zeros. What we will try to determine is the structure that will reside within each of the Ri zones. There is one rule. Once we decide on a structure, that structure must remain consistent across all subjects (zones).

What types of structures are there?

Let’s begin with the simplest structure. We define simplest by the number of parameters that will have to be estimated. The simplest structure is called variance component or VC. Some people will also call this simple structure. This is an independent and equal variance structure. This is likely not a reasonable structure for repeated measures data due to the observations within a subject being correlated. In this structure, there is only one parameter to be estimated.

The most complex structure is unstructured. Each pairing of observations within a subject has its own unique correlation. No patterns are assumed, and each parameter must be estimated. This vastly increases the number of parameters to be estimated.

Another simple correlation structure is compound symmetry. Some also refer to this as exchangeable correlation structure. It assumes that correlation is the same regardless of the distance between time points. This is also a good structure when the repeated measures are not collected over time (for example, students in a classroom).

The AR(1) model assumes that the correlation between adjacent observations in time is a value rho regardless of whether the pair of observations is the first and second, second and third, and so on. It also assumes that the correlation of any pair of observations two units apart in time is rho squared. Observations that are d units apart in time have correlation of rho raised to the d power.

Toeplitz structures are similar to AR(1) but are considered more general. It does assume that observations separated by a common distance in time share the same correlation. The change is that there is no patterning with the usage of a power.

All the previous structures require one aspect of the data to be true, the repeated measures must be equally spaced. But what if our observations are not equally spaced? For example, the first measure is taken one hour after receiving a treatment. The second measure is taken 2 hours after the first. The third is taken 1 hour after the second. The repeated measures are not equally spaced in this case. When this happens, we can turn to the spatial structures (spatial power and spatial exponential). These do not require equally spaced measures.

How do you decide which to use?

These are just a few of the possible choices for working covariance/correlation matrices. With all these choices, which one should I use. Is there any statistic that can help me compare them?

The best tip I can offer to help you decide is to play the game of who am I not before playing who am I. Aspects of the problem will push you away from certain choices and therefore reducing the number of possibilities that you may need to compare across. For example, if the spacing of the measures is equal, I would not consider the spatial types. If the measures are not using time as its focus for repeating, I would not consider AR(1). Any reduction of the number of choices in structures helps.

But what about comparing across possible structures? If within your model runs you have kept the fixed effects unchanged and you are using REML for the estimation method, you can use statistics like AIC, AICC, and BIC to compare. These statistics will work just like they did in the past when you used them to compare possible models in linear regression. Remember, smaller is better.

Example Time

Let’s look at an example of how we can perform this comparison. A pharmaceutical company wants to examine the effects of three drugs on the respiratory system of asthma patients. Each of the three drugs is randomly assigned to 24 patients. Measurements of the patients’ respiratory ability is taken hourly for eight hours after treatment.

In this example, we will compare AR(1), Toeplitz, and unstructured. For each model we will maintain the same fixed effects. By default, REML is the estimation method. On the REPEATED lines we indicate the structure using the TYPE= sub option. The SUBJECT= sub option indicates to SAS what is the smallest experimentation unit on which we repeat measure. In this case, it is patient. For each model, we are saving tables that contain the AIC, AICC, and BIC statistics. We then merge and print all these values across all models.

ods exclude all;
proc mixed data=asthma;
class drug patient hour;
model fev1=drug basefev1 drug*basefev1 hour drug*hour / ddfm=kr2;
repeated hour / type=ar(1) subject=patient;
ods output FitStatistics=FitAR1(rename=(value=AR1)) FitStatistics=FitAR1p Dimensions=ParmAR1(rename=(value=NumAR1));
run;

proc mixed data=asthma;
class drug patient hour;
model fev1=drug basefev1 drug*basefev1 hour drug*hour / ddfm=kr2;
repeated hour / type=toep subject=patient;
ods output FitStatistics=FitToep(rename=(value=Toep)) FitStatistics=FitToepp
Dimensions=ParmToep(rename=(value=NumToep));
run;

proc mixed data=asthma;
class drug patient hour;
model fev1=drug basefev1 drug*basefev1 hour drug*hour / ddfm=kr2;
repeated hour / type=un subject=patient;
ods output FitStatistics=FitUn(rename=(value=UN)) FitStatistics=FitUNp
Dimensions=ParmUN(rename=(value=NumUN));
run;

data fits;
merge FitAR1 FitToep FitUn;
by descr;
run;

ods exclude none;

proc print data=fits;
run;

Based on the AICC, unstructured is preferred over AR(1) and Toeplitz. Toeplitz is better than AR(1) and unstructured when using BIC. Remember that each of these statistics applies different penalties for complexity of the model then adds that penalty to the fit diagnostic. BIC will tend to lean towards models of less complexity where AIC and AICC will allow more complexity into the models. I suggest checking with your research area when deciding which of these statistics to use in your decision making.

Longitudinal data or repeated measure data can easily surface in many different areas of research. Ignoring the breach of independence that this data causes will result in statistical analyses that are not trustworthy and not reflecting the true nature of the data. With all the different types of structures that can be used in modeling the relationship of the observations within subjects, it can be a daunting task to determine which one to use for your situation. I do hope that this post has assisted in organizing this and provided some guidance in selecting structures.

Find more articles from SAS Global Enablement and Learning here.

TobbeNord · ‎09-03-2025

Really great explanation!

Repeat After Me: Understanding Correlation Matrices in Repeated Measures Models

SAS Innovate 2026 Registration is Open

SAS AI and Machine Learning Courses