Hi, I am a new SAS user. I am currently trying to replicate a paper using a mixed effect model to analyze the effect of genetic risk on cognitive functions using a longitudinal data set. In the article I would like to replicate, the authors state
"
Genetic associations with cognitive function or AD-related biomarkers were tested using linear mixed-effects regression models implemented in the SAS MIXED procedure to account for the within-family and within-subject (due to repeated measures) correlations while allowing for missing data[28]. Each mixed-effects model included random intercepts for both family and participant. A random effect for time (age centered) was included when its inclusion led to a better model fit.
"
I know how to account for within-subject correlation and give random intercept to each participant using the proc mixed procedure. Still, I don't know how to account for within-subject and within-family correlation at the same time while allowing the model to include random intercepts for both family and participant.
I am currently trying the following code, but I don't know if this code is consistent with what has been described by the method description I showed above. Also, running this code is too time-consuming, and I don't know if there are some other efficient methods I can use to save time. In this code, DBID is the unique individual ID; FAMILY is the unique family ID, VISNO is numeric time, TIME is character time. Thank you in advance if you can help me address this problem.
PROC MIXED DATA = PHENOTYPE_PGS_STD; CLASS DBID FAMILY TIME; MODEL Z_STAY_RCL = PGS_APP_DOLD AGEATVISIT GENDER VISNO / CHISQ S DDFM = BW; RANDOM INTERCEPT VISNO FAMILY / TYPE = UN SUBJECT = FAMILY G; REPEATED TIME / TYPE = UN SUBJECT = DBID R; RUN;
Your model looks okay to me, except that you might not want to include Family in the RANDOM statement. There might be other ways to model the correlations in your data, for example, use a different correlation structure in the REPEATED statement.
About the program taking a long time to run -- how many levels does TIME have? And are they all the same across individuals? I suspect TYPE=UN might be one of the reasons for a long time time, depending on the answers to my questions above. What happens if you try type=CS? Also, using SUBJECT=DBID(FAMILY) in the REPEATED statement might help with the run time.
See if the following program runs faster:
PROC MIXED DATA = PHENOTYPE_PGS_STD; CLASS DBID FAMILY TIME; MODEL Z_STAY_RCL = PGS_APP_DOLD AGEATVISIT GENDER VISNO / CHISQ S DDFM = BW; RANDOM INTERCEPT VISNO / TYPE = UN SUBJECT = FAMILY G; REPEATED TIME / TYPE = CS SUBJECT = DBID(FAMILY) R; RUN;
Moved to statistics community.
There are a couple of things to consider here. First, how many levels are included in the variable TIME? The unstructured covariance matrix requires estimating N*(N+1)/2 parameters where N is the number of levels. Next, you should not have FAMILY as the subject in your RANDOM statement when it is included as a random effect. These two things are almost certainly the cause of the long run-time. Try the following:
PROC MIXED DATA = PHENOTYPE_PGS_STD;
CLASS DBID FAMILY TIME GENDER;
MODEL Z_STAY_RCL = PGS_APP_DOLD AGEATVISIT GENDER VISNO / CHISQ S DDFM = BW;
RANDOM INTERCEPT FAMILY / TYPE = UN SUBJECT = DBID G;
REPEATED TIME / TYPE = AR(1) SUBJECT = DBID R;
RUN;
I changed the REPEATED type to an autoregressive error, which is appropriate if the levels of TIME are equally spaced. If they are not equally spaced, a spatial power structure or a spline structure may be more appropriate. The latter is only available in GLIMMIX, and only as a G side structure. But first try the code above and see what happens.
One other thing to think about. By using VISNO in the model, you are assuming a linear relationship between time and the dependent variable. Shifting to TIME in the model would allow for a more flexible relationship, at the cost of increased model complexity and run time.
SteveDenham
Hi Steve,
I tried your code but an error message pop up "ERROR: Model is too large to be fit by PROC MIXED in a reasonable amount of time on this system. Consider changing your model and/or using a different procedure. For example, you can fit large, sparse linear models using PROC HPMIXED".
Even though I googled and found that PROC HPMIXED might be better than PROC MIXED in computation speed, I still want to use the MIXED procedure because I used the same data as described by the author (of the paper I want to replicate), and I want to check if I can obtain the same results if I use the exact the same method.
Are there any other suggestions on how to improve the model specification?
Thanks very much!
Your model looks okay to me, except that you might not want to include Family in the RANDOM statement. There might be other ways to model the correlations in your data, for example, use a different correlation structure in the REPEATED statement.
About the program taking a long time to run -- how many levels does TIME have? And are they all the same across individuals? I suspect TYPE=UN might be one of the reasons for a long time time, depending on the answers to my questions above. What happens if you try type=CS? Also, using SUBJECT=DBID(FAMILY) in the REPEATED statement might help with the run time.
See if the following program runs faster:
PROC MIXED DATA = PHENOTYPE_PGS_STD; CLASS DBID FAMILY TIME; MODEL Z_STAY_RCL = PGS_APP_DOLD AGEATVISIT GENDER VISNO / CHISQ S DDFM = BW; RANDOM INTERCEPT VISNO / TYPE = UN SUBJECT = FAMILY G; REPEATED TIME / TYPE = CS SUBJECT = DBID(FAMILY) R; RUN;
Hi jiltao,
Thank you for your answer. This code works well on my computer, and it is really fast. The only question is, by specifying a model like this, is it enough to account for both within-subject and within-family correlation due to repeated measurements? Is this model include random intercepts for both family and subject?
Thank you very much!
The RANDOM statement specifies INTERCEPT with subject=family - so that estimates the variance component due to family. By including VISNO in the statement, you also estimate a component for family*visno and the covariance between these.
The REPEATED statement specifies the within subject variability over time by family {the subject=dbid(family) provides this}.
This is equivalent to a random intercept and random slope for subject and a random intercept for family.
If the time points are evenly spaced, I would be inclined to make one change - from type=cs to type=ar(1) in the REPEATED statement. I find it difficult to imagine that the covariances between the residuals at all time points are equal. Most temporal repeated measures have a greater covariance for time points that are closer together than those that are more separated in time.
SteveDenham
The within-subject correlations are modeled by the REPEATED statement.
The within-family correlations are modeled (indirectly) by the RANDOM statement (plus the REPEATED statement for certain observations within a family).
The random intercept for family is modeled in your RANDOM statement.
The random intercept for subject is not explicitly modeled in your program. But if you used TYPE=CS in the REPEATED statement, it essentially fits the same model as random int / subject= DBID(FAMILY); therefore can be considered being modeled. You could change the REPEATED CS statement to this random intercept statement to make it explicit. If you used other types, such as type=ar(1) or type=UN, etc, then there is no random intercept being modeled. You are not modeling random intercept for subject, instead, you are using the REPEATED statement to model the correlations directly, and it is an alternative model to a random intercept model for subject.
Hi jiltao,
Thanks for the heads up! Just one clarification question. Do you mean I must use the compound symmetry covariance structure to model the random intercepts for the subject? Suppose I change the TYPE = CS in the repeated statements to random intercept / DBID(Family) in the RANDOM statement(see below code). Can I use some different covariance structure and confident the model still have random intercepts for subjects?
An example code is shown below. Can I say this code gives random intercepts for both subject and family while using an unstructured covariance structure?
Thank you very much!
PROC MIXED DATA = PHENOTYPE_PGS_STD; CLASS DBID FAMILY TIME GENDER; MODEL Z_IMM_MEM = PGS_APP_DOLD AGEATVISIT GENDER VISNO / CHISQ S DDFM = BW; RANDOM INTERCEPT VISNO / TYPE = UN SUBJECT = DBID(FAMILY) G; REPEATED TIME / TYPE = UN SUBJECT = DBID(FAMILY) R; RUN;
You have two levels of correlations, family and subject. Your program above does not model the correlations at the family level, and over-specified the correlations at the subject level.
Random intercept refers to a model such as random int / subject=A.
If you must fit a random intercept model, and also must have a UN structure somewhere in there, then you might consider a model similar to the following --
PROC MIXED DATA = PHENOTYPE_PGS_STD; CLASS DBID FAMILY TIME GENDER; MODEL Z_IMM_MEM = PGS_APP_DOLD AGEATVISIT GENDER VISNO / CHISQ S DDFM = BW; RANDOM INTERCEPT VISNO / TYPE = UN SUBJECT = FAMILY G; RANDOM INTERCEPT VISNO / TYPE = UN SUBJECT = DBID(FAMILY) ; RUN;
If you must fit a UN covariance structure in the REPEATED statement, then there is no way I am aware of to make it a typical random intercept model.
I hope that makes sense.
Jill
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.