## Statistical programming, matrix languages, and more

Contributor
Posts: 43

Hi all,

I'm currently conducting a Monte Carlo simulation involving repeated measures data with 4 time points. I have successfully generated data however, when I create 1000 repetitions of the data and use PROC MIXED to fit a linear mixed model to each repetition, I find that the variances of the random slope and intercept are essentially 0. Also, I find that the fixed effects and the residual variance is correct, it is just that the random effects are 0 even though I specified them in the simulation. Below is the code I used to read in the data and then generate the data and fit the models, etc. Any help is appreciated.

Best,

Tim B.

data dental;

input ID Gender \$ Age8 Age10 Age12 Age14;

cards;

1        F       21.0    20.0    21.5    23.0

2        F       21.0    21.5    24.0    25.5

3        F       20.5    24.0    24.5    26.0

4        F       23.5    24.5    25.0    26.5

5        F       21.5    23.0    22.5    23.5

6        F       20.0    21.0    21.0    22.5

7        F       21.5    22.5    23.0    25.0

8        F       23.0    23.0    23.5    24.0

9        F       20.0    21.0    22.0    21.5

10        F       16.5    19.0    19.0    19.5

11        F       24.5    25.0    28.0    28.0

12        M       26.0    25.0    29.0    31.0

13        M       21.5    22.5    23.0    26.5

14        M       23.0    22.5    24.0    27.5

15        M       25.5    27.5    26.5    27.0

16        M       20.0    23.5    22.5    26.0

17        M       24.5    25.5    27.0    28.5

18        M       22.0    22.0    24.5    26.5

19        M       24.0    21.5    24.5    25.5

20        M       23.0    20.5    31.0    26.0

21        M       27.5    28.0    31.0    31.5

22        M       23.0    23.0    23.5    25.0

23        M       21.5    23.5    24.0    28.0

24        M       17.0    24.5    26.0    29.5

25        M       22.5    25.5    25.5    26.0

26        M       23.0    24.5    26.0    30.0

27        M       22.0    21.5    23.5    25.0

;

run;

*converting to long format*;

data long_dental;

set dental;

Age = 8;

measurement = Age8;

output;

Age = 10;

measurement = Age10;

output;

Age = 12;

measurement = Age12;

output;

Age = 14;

measurement = Age14;

output;

drop Age8-Age14;

run;

proc iml;

k = 4; *four time points*;

s = 200; *200 individuals*;

sigma2_int = 1.9211; *variance in random intercept*;

sigma2_slope = 0.02228; *variance in random slope*;

G = diag( sigma2_int // sigma2_slope ); *G matrix of random variances*;

sigma2 = 1.8787; *residual variance*;

beta = {16.7611, 0.6602}; *parameters for the fixed effects*;

Age = repeat(T(do(8, 14, 2)), s);

X = j(nrow(Age), 1, 1) || Age;

Z = j(nrow(Age), 1, 1) || Age;

ID = colvec(repeat(T(1:s), 1, k));

call randseed(1234);

create MixVC var {'SampleID' 'Measurement' 'Age' 'ID' 'eta' 'random' 'eps'};

*generate 1000 repetitions of the above data*;

do i = 1 to 1000;

zero = repeat(0, nrow(G));

gamma = RandNormal(1, zero, G);

eps = j(nrow(X), 1);

eta = X*beta;

random = Z*gamma`;

call randgen(eps, 'Normal', 0, sqrt(sigma2));

Measurement = eta + random + eps;

SampleID = repeat(i, k*s);

append;

end;

close;

quit;

%macro ODSOff; *macro to turn off ODS*;

ods graphics off;

ods exclude all;

ods noresults;

%mend;

*fit model to the simulated data*;

%ODSOff

proc mixed data = MixVC; *using proc mixed to fit linear mixed model to each repetition of data*;

by SampleID;

class ID;

ods output CovParms = cp SolutionF = fixed_effects;

model Measurement = Age / s;

random int Age / subject = ID type = vc;

repeated / type = vc subject = ID;

run;

*call back output*;

%macro ODSON;

ods graphics on;

ods exclude none;

ods results;

%mend;

*check sampling distribution of parameters*;

%ODSON

proc means data = cp;

var Estimate;

class CovParm;

run;

proc means data = fixed_effects;

var Estimate;

class Effect;

run;

Posts: 2,655

Based on the presented data, I think the following PROC MIXED code might be a more accurate description.  Try:

proc mixed data = MixVC; *using proc mixed to fit linear mixed model to each repetition of data*;

by SampleID;

class ID age;

ods output CovParms = cp SolutionF = fixed_effects;

model Measurement = Age / s;

random int  / subject = ID;;

repeated age / type = un subject = ID;

run;

This has both a G side and an R side component, and it is quite likely that the R side variance will dominate to the point that the G matrix is not positive definite (variance component is zero).

It appears that the simulation puts all of the variability into a G matrix (random slope and intercept), but also tries to fit an additional repeated measures term that is likely, again, dominating the G matrix.  What happens to the estimates if you drop the REPEATED statement?  Do the estimates now look something like the prior values used in the simulation?

Steve Denham

Contributor
Posts: 43

Hi Steve,

Thanks for the reply. In the random statement, why is Age dropped? I believe that in the way I specified the simulation, I had Age as a random effect. Furthermore, in the repeated statement, why is type = un? The R matrix is specified as variance components.

Thanks,

Tim

Posts: 2,655

For the first attempt at an analysis, age is moved from the random statement to the repeated statement and considered as a class variable, as that is the design used.  I recognize that you are attempting to simulate a random intercept-random slope model, and I can see that as an appropriate analysis.  However, my real reason for doing so is to model the correlation of observations over time--something not captured in your original analysis, and consequently my primary suspect in guessing why the G side estimates are coming out at zero when analyzing the simulated data.  It is also why I suggested switching the type=vc to type=un, as the type=vc assumes no autocorrelation in the data.  Since the source data is obviously autocorrelated, that needs to be captured in the analysis, and then in the generation of the simulated data.  I hope that makes some kind of sense.

Steve Denham

SAS Super FREQ
Posts: 4,175

Said differently, you might consider changing the computation of 'eps'. Currently you have 'eps' as uncorrelated (independent) errors.  In my book Simulating Data with SAS  (which I assume you have, since your notation follows the book), I show how to produce autocorrelated errors by simulating 'eps' from MVN(0, R), where R defines the autocorrelation in the errors. See pp 232-236.

The example in the book uses 5 subjects, so I construct a big block diagonal matrix for R.  You have 200 subjects, so your block-diagonal matrix will be fairly big. Alternatively, you can simulate each subject within a loop and eliminate the need for such large matrices.

Contributor
Posts: 43

Thanks Steve and Rick for the responses. The reason I made the errors uncorrelated was because in my simulation study, I have several conditions for the R matrix: VC, unstructured, compound symmetry, AR(1), Toeplitz, and ANTE(1). Although the empirical dataset may not have the errors as uncorrelated, I still wanted to fit a model using PROC MIXED where the R matrix is VC (uncorrelated errors) and then generate data according to the parameters specified by PROC MIXED. I hope I am going about this the right way. Let me know if there is any confusion regarding the simulation and how I'm explaining it.

Again, thank you Steve and Rick, for the help.

Contributor
Posts: 43

Below is code I have composed that correctly generates data. I am assuming that the G and R matrices are VC and then I use PROC MIXED to fit a model to each iteration. If you use the code, you can see that when I check the means of the random effects, they are 1.9211 and 0.02228 for the variance in the intercept and slope, respectively. For the next part of my simulation, I am going to assume that the errors are compound symmetric. However, I am having trouble trying to specify a compound symmetric error structure. Does anyone have an idea?

DATA VCdental;

gamma_00 = 16.7611; *intercept*;

gamma_01 = 0.6602; *slope*;

sigma2_0 = 1.9211; *variance in the intercept*;

sigma2_1 = 0.02228; *variance in the slope*;

sigma2_e = 1.8787; *error variance*;

N = 200;

seed = 20140501;

T = 4;

DO d = 1 to 1000; *1000 repetitions*;

DO I = 1 to N; *1 to N individuals*;

b_0n = gamma_00 + sqrt(sigma2_0) * RANNOR(seed); *intercept with added variance*;

b_1n = gamma_01 + sqrt(sigma2_1) * RANNOR(seed); *slope with added variance*;

DO Age = 8 to 14 by 2;

y_nt = b_0n + b_1n * Age + sqrt(sigma2_e) * RANNOR(seed); <---how would I specify a compound symmetric error structure in this statement?

OUTPUT;

END;

END;

END;

RUN;

%macro ODSOff;

ods graphics off;

ods exclude all;

ods noresults;

%mend;

*fit model to the simulated data*;

%ODSOff

proc mixed data = VCdental;

class i;

model y_nt = Age / s;

random int age / subject = i g;

repeated / subject = i r;

ODS OUTPUT SOLUTIONF = fixedE;

ODS OUTPUT CovParms = randomE;

by d;

run;

*call back output*;

%macro ODSON;

ods graphics on;

ods exclude none;

ods results;

%mend;

%ODSON

PROC MEANS DATA = fixedE;

CLASS effect;

VAR estimate;

RUN;

PROC MEANS DATA = randomE;

CLASS CovParm;

VAR estimate;

RUN;

SAS Super FREQ
Posts: 4,175

First, don't use RANNOR for simulation. Second, you can use SAS/IML to specify many covariance structures.  This article describes compound symmetry and other structures.. In my opinion, you should use the SAS/IML language if you are interested in correlated variables or correlated errors.

Contributor
Posts: 43

Hi Rick,

Thanks for the advice. I didn't use PROC IML and used the DATA step because the latter option led to correct simulation results for me. If I could get the PROC IML simulation to work I would gladly do it that way. However, I have followed the examples used in your book and toyed with the code for days now and I'm still running into the problem where my random variances average out to zero. Furthermore, I have tried your advice on simulating correlated errors. However, the simulation is still not running correctly. Below is the code I have used to generate data according to a compound symmetric error structure:

proc iml;

k = 4;

s = 200;

sigma2_int = 1.8430;

sigma2_slope = 0.02228;

G = diag( sigma2_int // sigma2_slope );

sigma2_R = 0.07809;

sigma2_CS = 1.8787;

B = sigma2_R*j(k, k, 1) + sigma2_CS*I(k);

R = B;

do i = 2 to s;

R = block(R, B);

end;

beta = {16.7611, 0.6602}; *parameters for the fixed effects*;

Age = repeat(T(do(8, 14, 2)), s);

X = j(nrow(Age), 1, 1) || Age;

Z = j(nrow(Age), 1, 1) || Age;

ID = colvec(repeat(T(1:s), 1, k));

call randseed(1234);

create MixCS var {'SampleID' 'Measurement' 'Age' 'ID' 'gamma' 'eps'};

*generate 1000 repetitions of the above data*;

do i = 1 to 1000;

zero = repeat(0, nrow(G));

gamma = RandNormal(1, zero, G);

eps = j(nrow(X), 1);

call randgen(eps, 'Normal', 0, R);

Measurement = X*beta + Z*gamma` + eps;

SampleID = repeat(i, k*s);

append;

end;

close;

quit;

Do you see anything wrong with the code that might be creating my problem?

Thanks again,

Tim

SAS Employee
Posts: 19

Hi Tim,

You specified G to be a diagonal matrix (no correlation between the random intercept and random slope). This is fine, but make sure that it is what you intended. Also, you drew a single sample for the random effects (rather than 200 samples). This means that everyone in the data set has the same intercept and slope, hence the zero variances. Try the code below. I replicated what you were trying to do, but without a compound symmetric structure for R. I left that off because you should generally avoid specifying a compound symmetric structure for R while also including random effects in the model (the random intercept and the residual within-cluster covariance end up being redundant). An AR(1) structure might be preferable if you want both R-side random effects and G-side random effects in the same model. Also, I switched the sampling of the random effects and the error to be performed in blocks. This decreases the memory requirements for the simulation, which will be helpful if you want to test the simulation for a large number of subjects.

Best,

Steve

proc iml;

/***************************************************************************

Specify model parameters

***************************************************************************/

n_l2 = 200;   *Number of people;

n_l1 = 4;     *Number of time points per person;

n = n_l1 * n_l2;

seed = 314159;   *Random value seed;

call randseed(seed); *Specifies a seed for all random number generation;

*Specify rixed intercept and slope;

fixed = {16.7611 0.6602};

*Specify block of G matrix;

*Treating G as block-diagonal reduces memory requirements;

g_block = {1.8430 0,

0      0.02228};

*Specify block of R matrix;

*Treating R as block-diagonal reduces memory requirements;

*R is currently set to have a block-diagonal structure;

r_block = 1.8787 # i(n_l1);

/***************************************************************************

Set up matrices and draw random values

***************************************************************************/

*Draw intercepts and slopes from a multivariate normal distribution;

l2_random = randnormal(n_l2, fixed, g_block);

l1_random = j(n,2,0);

l2_id = j(n,1,0);

do i = 1 to (n_l2);

*Generate cluster ID variable;

l2_id[n_l1*(i-1)+1:n_l1*(i-1)+n_l1,1]=i;

*Copy random values from level-2 matrix to level-1 matrix;

do k = 1 to n_l1;

l1_random[n_l1*(i-1)+k,]=l2_random[i,];

end;

end;

*Generate level-1 ID variable;

l1_id = (1:n)`;

*Generate age variable;

age = repeat(do(8,14,2)`,n_l2,1);

*Draw level-1 errors from a multivariate normal distribution;

errors = colvec(randnormal(n_l2, j(n_l1,1,0), r_block));

*Generate dependant variable;

dv = l1_random[,1] + l1_random[,2] # age + errors;

output = l2_id || l1_id || age || dv;

names = {l2_id l1_id age dv};

create multilevelData from output [colname = names];

append from output;

close multilevelData;

quit;

proc mixed data=multilevelData;

model dv = age / s;

random int age / subject=l2_id type=un;

run;

Posts: 2,655

Steve--this is some sweet code.  Thanks for sharing it here.

Steve Denham

SAS Employee
Posts: 19

Hi Steve,

I'm glad you like it! If we want to get really fancy, we can simplify the code further with Kronecker products. All of this:

l1_random = j(n,2,0);

l2_id = j(n,1,0);

do i = 1 to (n_l2);

*Generate cluster ID variable;

l2_id[n_l1*(i-1)+1:n_l1*(i-1)+n_l1,1]=i;

*Copy random values from level-2 matrix to level-1 matrix;

do k = 1 to n_l1;

l1_random[n_l1*(i-1)+k,]=l2_random[i,];

end;

end;

can be replaced with this:

*Convert l2_random to a level-1 vector (same values repeat within cluster);

l1_random = l2_random @ j(n_l1,1,1);

*Generate level-2 ID variable;

l2_id = (1:n_l2)` @ j(n_l1,1,1);

Because the new code has no DO loops, it will run even faster.

Best,

Steve

Discussion stats
• 11 replies
• 953 views
• 2 likes
• 4 in conversation