BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
WillTheKiwi
Pyrite | Level 9

Apologies for this long posting. I hope some of the mixed-model mavens will take the time to read and respond. I posted this request for help on the datamethods forum, but so far no useful replies, hence trying this SAS community.

 

I am using Proc Glimmix to account for repeated measurement (of counts of things representing performance in multiple matches of different sports teams, but that is not particularly relevant). The random-effect solution for the identity of the subjects (the teams) represents relative magnitudes of the subject means of the dependent variable. The covariance parameter for subject identity is provided as a variance, and I have always interpreted the square root of that covparm as the between-subject standard deviation, after adjustment for everything else in the model. Residual error has been partitioned out of it, so I call it the true between-subject SD, as opposed to the observed between-subject SD, which includes residual error.

 

Fine (or at least I hope so), but here’s my problem. The SD of the random-effect solution gives another estimate of the pure between-subject SD. I realize it isn’t right to start with, because of a degrees-of-freedom issue: if I have n subjects, the SD of the random-effect solution is calculated as if there are n-1 degrees of freedom, but the SD provided by the variance for the subjects is calculated, or at least should be consistent with, the actual degrees of freedom for the variance. An estimate of the degrees of freedom is given by 2*Z^2, where Z is the variance divided by its standard error. Well, I’ve done the calculation to correct the SD of the random-effect solution using the degrees of freedom, and the correspondence is not exact, but it’s near enough, so let’s assume that the mismatch between the SDs is just a degrees-of-freedom issue. With these data the degrees of freedom of the subject variance are small–I am getting values of around 1 or even less, sometimes–so the SD of the random-effect solution is a lot less, by a factor of ~10, than the SD given by the square root of the covparm variance. So according to the random-effect solution, there are small differences between teams, but according to the SD of the covparm variance, the differences between teams are 10x greater. I actually want to use the random-effect solution to assess individual teams, but I am reluctant to, because things don’t add up. I am also using the random-effect solution as a linear within-subject predictor in another linear model with a different dependent variable, and I use 2 SD of linear predictors to assess the magnitude of the effect of such predictors. I started off using twice the SD derived from the square root of the covparm, but it gave ridiculous answers.  When I use twice the SD of the random-effect solution, the answers are sensible. Hmmm...

 

Immediately after I posted the above, I realized I should have added another related question. I use nobound to allow negative variance, and SAS provides a random-effect solution even when the variance is negative. I checked that the solution is included additively in the linear model to give predicted values, in exactly the same manner as when the variance is positive. But now how do I interpret the random-effect solution? It’s like there are negative real differences between the subjects.

 

Someone on the datamethods forum wondered whether I had a convergence or parameterization problem.  Here's my reply... My question about negative variance is not a convergence or parameterization issue. In Proc Mixed and Proc Glimmix you can state “nobound” to allow negative variance. (Proc Nlmixed doesn’t allow it, by the look of the documentation.) Nobound increases the risk of failure to converge, but I can usually get around that by stating initial values of the covparms and/or by relaxing the convergence criteria, then relaxing them even further to check that there is no substantial change in the estimates. Negative variance is pretty-much essential when you have a random effect representing individual responses, because it’s the only way to get sensible compatibility (confidence) intervals. And it’s pretty obvious that sampling variation can result in negative variance, when the sample size and/or true variance is small. I’ve used simulation to check that the intervals include true values at the chosen level of the interval.

1 ACCEPTED SOLUTION

Accepted Solutions
WillTheKiwi
Pyrite | Level 9

In case anyone is still interested in this topic, I have resolved the problem of the variance of the random-effect solution (as given by ods output solutionr= in Proc Mixed) being somewhat (and sometimes a lot) less than the variance given by its corresponding covparm (as given by ods output covparms=). I did it by developing another simulation for a study of repeated measurements, where each subject (athletes) has several tests in one season followed by several tests in the next season. The random-effect structure is given by random int Season/subject=Athlete;. This model estimates differences between athletes (the solution for the random Intercept/subject=Athlete, changes within athletes between seasons (the solution for random Season/subject=Athlete), and changes within athletes within seasons (the residuals). The individual solution values for the random effects all have standard errors, and it turns out that the variance of the solution plus the mean standard error squared is equal to the variance given by the covparms. Even the residuals must have standard errors, but I don't know how to estimate those and I assume they account for the difference between the variance of the residuals and residual variance given by the covparms.

 

See below for the code for this simulation. I included lots of code to check that the random-effect solution produces unbiased estimates, and that the 90% compatibility intervals for each solution value did indeed include the true (simulated) value 90% of the time. I could have done it all with the previous code for individual responses, but I wanted to simulate repeated testing of athletes for a project I am helping with at Olympiatoppen (the Norwegian Olympic Sports Center). I'll be using the residuals and possibly the random-effect solution for Season as linear predictors in another mixed model. Their linear effects will be attenuated by the standard errors in each value. These standard errors vary a little bit, depending on the number of repeated measurements each athlete has, but the mean will be given by the difference in the variances described above. I don't have to calculate it, because all I am interested in is the correction for attenuation of a linear predictor that has error; the correction factor is given by 1/ICC (ICC = intraclass correlation coefficient), and the ICC is given by (pure variance)/(observed variance), which here is the variance of the random-effect solution divided by the covparm variance.

 

I won't be allowing negative variances with the upcoming data, so there won't be any problem with interpreting the random-effect solution.

 

Will

 

*simulates repeated tests within each of two seasons;
*some of these macro variables are irrelevant here, copied from another program;

%let Ssize=100; *number of subjects;
%let BtwnSD=4; *true differences between subjects, to be estimated by a random effect;
%let Mean=100; *true grand mean, irrelevant here;
%let WthnErr=2; *error within subjects between tests within seasons, to be estimated by the residual;
%let BtwnErr=1; *extra error within subjects between seasons, to be estimated by a random effect;
%let MaxTestsWthn=4; *maximum number of tests in each season;

%let alpha=0.1;
%let nob=; *make it nob=bound to allow negative variance;
%let convcrit=CONVH=1E-8 convf=1E-8; *make the values 1E-7, 1E-6 or smaller if fail to converge;
%let tvalue=3.5; *residual standardized threshold for outliers;
%let logflag=0; *set to 1 for log transformation of the dependent;
%let deceff=1; *decimal places for effects and SDs;

data dat1;
TestID=0;
do Athlete=1 to &Ssize;
  YtrueRand=&BtwnSD*rannor(0);
  do Season=1;
    BtwnSeasErr=&BtwnErr*rannor(0);
    do TestNo=1 to 2+int((&MaxTestsWthn-1)*ranuni(0));
      WthnSeasErr=rannor(0)*&WthnErr;
      Yobsvd=&Mean+YtrueRand+BtwnSeasErr+WthnSeasErr;
      TestID=TestID+1;
      output;
      end;
    end;
  do Season=2;
    BtwnSeasErr=&BtwnErr*rannor(0); 
    if ranuni(0)>0.5 then do; *this gives 50% chance of any values in Season 2;
    *if ranuni(0)>0 then do; *always have values in Season 2;
      *do TestNo=1 to 2+int((&MaxTestsWthn-1)*ranuni(0)); *on average, as many as in Season 1;
      do TestNo=1 to 1+int((&MaxTestsWthn-1)*ranuni(0)); *sometimes only one test in Season 2;
        WthnSeasErr=rannor(0)*&WthnErr;
        Yobsvd=&Mean+YtrueRand+BtwnSeasErr+WthnSeasErr;
        TestID=TestID+1;
        output;
        end;
      end;
    end;
  end;

title "Values for first three athletes";
proc print data=dat1;
where Athlete<4;
format YtrueRand BtwnSeasErr WthnSeasErr Yobsvd 5.2;
run;

ods select none;
title "Repeated tests over two seasons, mixed model";
proc mixed data=dat1 covtest cl alpha=&alpha  &nob &convcrit;
class Athlete Season;
model Yobsvd=/s outp=pred residual alphap=&alpha ddfm=sat ;
random int Season/subject=Athlete s cl alpha=&alpha; * type=un;*random intercepts and slopes model;
*lsmeans Mean Gender/cl alpha=&alpha;
*estimate "Male/female" Gender 1 -1/cl alpha=&alpha;
ods output covparms=cov;
ods output estimates=est;
ods output lsmeans=lsm;
ods output solutionr=solr; 
ods output solutionf=solf; 
ods output classlevels=clev;
*by Sport;
*where Sport="Endur";
run;
ods select all;

data cov1;
set cov;
DegFree=2*Zvalue**2;

title2 "Random effects as variances";
title3 "True values: Intercept=&BtwnSD**2;  Season=&BtwnErr**2;  Residual=&WthnErr**2";
proc print data=cov1;
format _numeric_ 5.&deceff DegFree 5.0;
run;

title2 "Variance of residuals";
proc means data=pred noprint;
var resid;
output out=pred1(drop=_type_ _freq_) n=NoOfObs var=Variance;
run;


proc sort data=solr;
by Effect;

title2 "First 10 observations of Intercept random-effect solution";
proc print data=solr(obs=10);
where Effect="Intercept";
run;

title2 "First 10 observations of Season random-effect solution";
proc print data=solr(obs=10);
where Effect="Season";
run;

data solr0;
set solr;
SEsq=StdErrPred**2;

title2 "Variance of random-effect solutions";
proc means data=solr0 noprint;
var estimate SEsq;
by Effect;
output out=solr1(drop=_type_ _freq_ d) n=NoOfObs var=Variance mean=d SEsq;
where estimate ne 0;
run;

*proc print;run;

data solresid;
set solr1 pred1;
if Effect="" then Effect="Residual";

data all;
merge solresid cov1(keep=estimate degfree rename=(Estimate=CovParmVariance));
VarianceAdjustDF=Variance*(NoOfObs-1)/DegFree;
VarPlusSEsq=Variance+SEsq; 
EstimatedSEsq=CovParmVariance-Variance;
CorrAttenFactor=CovParmVariance/Variance;

title2 "Variance of random-effect solutions and residuals, and covparm variance";
title3 "SEsq is the mean of StdErrPred squared of the solutions";
title4 "The covparm variance is evidently the sum of the variance and SEsq";
title5 "Hence the estimate of SEsq for the residual variance (I can't estimate error in each residual)";
title6 "When a random-effect solution or residuals are used as linear predictors, their effects";
title7 "will need to be corrected upwards by CorrAttenFactor = CovParmVariance/Variance = 1/ICC.";
title8 "Variance adjusted for DF does not work: it's always greater than the covparm variance";
proc print;
var Effect NoOfObs Variance SEsq VarPlusSEsq CovParmVariance EstimatedSEsq
  DegFree VarianceAdjustDF CorrAttenFactor;
format _numeric_ 5.&deceff CorrAttenFactor 5.2 DegFree 5.0;
run;


title2 "Check individual true values of Season against random-effect solution";
*proc print data=solr;run;
data season;
set solr;
if Effect="Season";
keep Athlete Estimate Season Lower Upper;

*proc print;run;

proc sort data=dat1;
by Athlete Season;

data dat2;
set dat1;
if lag(Athlete) ne Athlete or lag(Season) ne Season;
keep Athlete Season BtwnSeasErr;

*proc print;
run;

data season1;
merge season dat2;
by Athlete Season;
if BtwnSeasErr ne .; *needed when a subject has no test in Season 2;
if BtwnSeasErr<Lower or BtwnSeasErr>Upper then Type0error="***";

*proc print;run;
title3 "Type0 error (i.e., when 90%CI for individual estimates does not include true value)";
title4 "Expected value is 10%";
proc freq;
tables Type0error/missing nocum;
run;

title3 "Expected true BtwnSeasErr: StdDev=&BtwnErr; variance=&BtwnErr**2";
proc means n mean std var maxdec=&deceff;
var Estimate BtwnSeasErr;
run;

title3 "True value (BtwnSeasErr) vs solution (Estimate)";
title4 "Line of identity shown in black, indicating no bias in the estimated value";
title5 "(Rerun the analysis to show that any apparent bias is just sampling variation.)";
ods graphics / reset width=14cm height=12cm imagemap attrpriority=none;
proc sgplot data=season1; *uniform=all;
styleattrs  
  datacolors=(red blue white)
  datacontrastcolors=(black)
  datasymbols=(circlefilled); *needs a group= for styleattrs to work;
*scatter x=pred y=Resid/markerattrs=(size=6 symbol=circlefilled) filledoutlinedmarkers
  group=PossessionType;
scatter x=Estimate y=BtwnSeasErr/markerattrs=(size=8 color=black symbol=circlefilled) filledoutlinedmarkers MARKERFILLATTRS=(color=lightgreen); * group=Gender;
scatter x=BtwnSeasErr y=BtwnSeasErr/markerattrs=(size=3 color=black symbol=circlefilled);
*scatter x=pred y=Resid/markerattrs=(size=8 color=black symbol=circlefilled) filledoutlinedmarkers MARKERFILLATTRS=(color=lightgreen);
*scatter x=StrokeNo y=Resid/markerattrs=(size=3 color=black symbol=circlefilled) filledoutlinedmarkers MARKERFILLATTRS=(color=black);
refline 0/axis=y lineattrs=(pattern=dot thickness=1 color=black);
refline 0/axis=x lineattrs=(pattern=dot thickness=1 color=black);
reg x=Estimate y=BtwnSeasErr/degree=1 lineattrs=(pattrn=solid thickness=1 color=blue) legendlabel="linear" nomarkers;
*reg x=pred y=Resid/degree=2 lineattrs=(thickness=1 color=red) legendlabel="quadratic" nomarkers;
*by Sport;
run;
ods graphics / reset;

 

View solution in original post

13 REPLIES 13
SteveDenham
Jade | Level 19

This is a great question, and I'm sorry the datamethods group wasn't really helpful (need to remember that the group is R oriented for a lot of things).

Without sounding cruel, could you please tell me how a negative value for a variance has meaning in the context of real data? That question, by itself, led to the development of REML methods, as the Henderson type I methods (which can give negative values) were severely biased in estimating breeding values (animal breeding was the first big adopter of mixed model methods, so far as I know). We know ML estimators are biased, and REML estimators less biased, and are asymptotically unbiased. So this is a learning experience for me - I can't wrap my head around how to interpret a negative variance. Either the true value is a boundary condition (=0) or the model is misspecified in some sense. OK, I'll put the soapbox away for a while and try answering the first part of your post.

 

A copy of the output (at least for the random effects and the model) would be really helpful in addressing your questions.  In return, I refer you to the following papers in hopes that it will help:  

Stroup&Claassen Pseudolikelihood vs Quadrature 

 

Kiernan SGF GLIMMIX categorical outcomes with random effects 

 

Note that there is a clear contradiction between Kiernan and Stroup&Claassen regarding bias of estimates, when comparing REML (pseudo-likelihood) to ML (.Laplace or quadrature methods).  Personally, if my interest was primarily in fixed effects, I would use REML methods, and if primarily interested in random effects, I would use ML methods, and accept the inherent bias of MLE's as compared to REML estimates.  I suppose that is where the negative variances arise.  You might consider using some of the empirical shrinkage methods in that case, to see if it "improves" those estimates.

 

SteveDenham

WillTheKiwi
Pyrite | Level 9

Steve, thanks heaps for the detailed reply. I haven't followed those links yet, because they may not be relevant. What I have done instead is write a simulation (shown below) for a controlled trial in which there are individual responses in an experimental group and in which the change scores in the control and experimental groups are analyzed with two simple mixed models: the first estimates extra variance in the experimental group, representing the individual responses in that group; the second estimates extra variance in the control group, in which case the estimate of the variance is expected to be negative, equal and opposite to the estimate in the first model. It all hangs together nicely. Both models give exactly the same values for the mean effect of the treatment, the residual variance, and the random-effect variance (with a change of sign). So I conclude there is no problem with allowing negative variance when the data need it, and we should allow negative variance anyway, to get unbiased estimates and compatibility limits. (Note that, if the estimate of the variance is positive, allowing only positive variance makes no difference to the estimate and its standard error, but the compatibility limits provided by SAS are unrealistic, because they are based on an unrealistic assumption about the sampling distribution of the variance. The assumption of normality when variance is allowed to be negative produces correct coverage of the interval. Bias would arise when only positive variance is allowed, because negative values of variance arising from sampling variation--or indeed if the true value is negative because of truly less variance in the experimental group--are set to zero.)

 

However, there is a bit of a problem with the random-effect solution when the corresponding variance is negative: SAS doesn't want to give it a standard error (and therefore compatibility limits). This problem occurred occasionally with some but not all values of the random-effect solution in my generalized mixed model with much more data and more complex fixed- and random-effects models, but it occurs every time with every value of the solution in this simulation.

 

I figured that, because of the way the random-effect solution is added to the other effects to get predicted values, there would have to be a negative correlation between the random-effect solution and the residuals, when the variance is negative (the second model), because the residual variance is the variance in the experimental group, so the random-effect solution in the control group has to add to and reduce the values of the residuals in the control group to give the lower SD in the control group. Yes, the correlation was negative alright, but unexpectedly it was perfect, -1.00. In the first model, I didn't expect a correlation, but it was also perfect, this time positive, 1.00. These correlations must arise from the way the variance is partitioned between the residuals and the random effect.

 

I conclude that, at least in this simple individual-responses data and model, the random-effect solution for a negative variance has to be interpreted in the light of the residuals and any other random effects that it is competing with. When the true extra variance is in the experimental group, but the extra variance is estimated in the control group (the second model), the random-effect solution for the negative variance represents individual responses that reduce the residual variance (which represents the variance of the experimental group) rather than increase it, as in the first model. So the random-effect solution for negative variance does not have any immediate useful interpretation, and the lack of compatibility limits is even more problematic for the interpretation of the values.

 

Not shown is the estimation of the variance representing individual responses derived by squaring the SD of the random-effect solution and multiplying by a degrees-of-freedom correction factor, but I did it in a spreadsheet, and the correspondence is good, so the mismatch between the two estimates is just a degrees-of-freedom issue. I am still undecided about whether I should use the random-effect solution as is, or whether I should inflate the values so that their simple SD squared is the same as the covparm variance. 

 

Will

 

/*
Simulation of a controlled trial, with individual responses in an exptal group.
Analysis of the change scores is done with a general linear mixed model allowing for extra variance, 
first in the exptal group, then in the control group. */ %let SDdelta=1; *SD of change scores in control group (error of measurement is 1/sqrt(2) times this value); %let MeanResp=5; *mean response to a treatment in exptal group; %let IndResp=2; *SD of individual responses in exptal group; %let GrpSampSize=10; *sample size in control and exptal groups; %let alpha=0.1; *alpha for compatibility limits; %let seed=2; *set to 0 for a new random sample every time; data dat1; do SubjectID=1 to &GrpSampSize; Group="Control"; IndResp=0; Y=&SDdelta*rannor(&seed); xVarCont=1; *dummy variable to estimate extra variance representing individual responses in control group; xVarExpt=0; *dummy variable to estimate extra variance representing individual responses in exptal group; output; end; do SubjectID=&GrpSampSize+1 to 2*&GrpSampSize; Group="Exptal"; IndResp=&IndResp*rannor(&seed); Y=&SDdelta*rannor(&seed)+&MeanResp+IndResp; xVarCont=0; xVarExpt=1; output; end; format _numeric_ 5.2; title "The data";
proc print; run; proc means maxdec=2; class Group; run; title1 "Individual responses estimated in Exptal group"; ods select none; proc mixed data=dat1 covtest cl nobound alpha=0.1; class SubjectID Group; model Y=Group/noint outp=pred ddfm=sat alphap=&alpha ddfm=sat; random xVarExpt/subject=SubjectID s cl alpha=&alpha; estimate "Treatment effect" Group -1 1/cl alpha=&alpha; ods output covparms=cov; ods output solutionr=solr; ods output estimates=est; run; ods select all; title2 "Treatment mean effect (Expt-Control); expected value = &MeanResp"; proc print data=est; run; title2 "Covparms, expected values: xVarExpt = &IndResp**2; Residual = &SDdelta**2"; data cov1; set cov; DegFree=2*Zvalue**2; proc print data=cov1; run; title2 "Random-effect solution for xVarExpt"; proc print data=solr; run; title2 "Predicted and residual values"; proc print data=pred; run; title2 "Merge residuals and random-effect solution and get correlation"; data residrand; merge pred solr(rename=(Alpha=AlphaEst Lower=LowerEst Upper=UpperEst)); by SubjectID; proc print; var SubjectID Group IndResp Resid Estimate LowerEst UpperEst AlphaEst; run; proc corr; var Resid; with Estimate; by Group; run; title1 "Individual responses estimated in Control group"; ods select none; proc mixed data=dat1 covtest cl nobound alpha=0.1; class SubjectID Group; model Y=Group/noint outp=pred ddfm=sat alphap=&alpha ddfm=sat; random xVarCont/subject=SubjectID s cl alpha=&alpha; estimate "Treatment effect" Group -1 1/cl alpha=&alpha; ods output covparms=cov; ods output solutionr=solr; ods output estimates=est; run; ods select all; title2 "Treatment mean effect (Expt-Control); expected value = &MeanResp"; proc print data=est; run; title2 "Covparms, expected values: xVarCont = -&IndResp**2; Residual = &SDdelta**2+&IndResp**2"; data cov1; set cov; DegFree=2*Zvalue**2; proc print data=cov1; run; title2 "Random-effect solution for xVarCont"; proc print data=solr; run; title2 "Predicted and residual values"; proc print data=pred; run; title2 "Merge residuals and random-effect solution and get correlation"; data residrand; merge pred solr(rename=(Alpha=AlphaEst Lower=LowerEst Upper=UpperEst)); by SubjectID; proc print; var SubjectID Group IndResp Resid Estimate LowerEst UpperEst AlphaEst; run; proc corr; var Resid; with Estimate; by Group; run;

   

SteveDenham
Jade | Level 19

@WillTheKiwi - very interesting.  Just out of curiosity, I ran the data through PROC GLIMMIX with no separation of the two groups as in your method to get a likelihood ratio test of variance homogeneity for the two groups using this code:

 

proc glimmix data=dat1 chol;
class subjectid group;
model y=group/ddfm=sat solution noint;
random _residual_/group=group solution;
lsmeans group/diff;
covtest homogeneity;
run;

The test had a chisquared value of 2.46 and an associate p value of 0.1165 for the analysis with 10 per group.  Residual VC's for the two groups were control = 0.9428 and exptal = 2.7497.

 

Then I cranked this up to 100 per treatment group.  Now the heterogeneity was obvious and significant (p<0.0001).  The least squares means were much closer to the simulated base values of 0 and 5, while the variance components went to ~1 and ~5.  This approach always yielded positive VC's.  So my feeling here is that I have been approaching this issue in a substantially different way, where 0 or negative values lead to a nonpositive definite G matrix, at which point I consider re-parameterizing the random effects.  I will keep this work in my "I learned some more about mixed models" folder for future reference.  Thanks for the effort and the code!

 

SteveDenham

 

WillTheKiwi
Pyrite | Level 9

Thanks for your continuing engagement, Steve. The group= approach is the easier way to account for different variance in the control and exptal groups, but it doesn't give compatibility limits for their difference, which in this case represent individual responses. By the way, I never test for homogeneity of variance. If the variances could be different, I specify different variances. If the resulting observed values are similar, I haven't lost anything, and I get a chance to see how different they could be via the compatibility limits.  Besides, such tests are too dependent on sample size: if the sample size is small, you can have substantial heterogeneity that is not significant; if the sample size is large, you can have trivial heterogeneity that is significant. The same goes for tests of normality, which I also never do. In any case, it's difficult to know what is a small vs a large sample size, because that depends in a complicated way on smallest important values for effects. It's always better to get compatibility intervals and interpret the lower and upper limits.

WillTheKiwi
Pyrite | Level 9

Sorry, forgot to add that "I am still undecided about whether I should use the random-effect solution as is, or whether I should inflate the values so that their simple SD squared is the same as the covparm variance." It comes back the issue that, if I interpret the SD given by the square root of the covparm variance, I get a value with an expected value that is correct: the SD I have used to generate the individual responses, and it's substantial, if I make it large enough. (Let's leave aside the issue of halving the usual magnitude thresholds when you are interpreting the magnitude of an SD.)  But, if that SD has only a few degrees of freedom, then the SD of the random-effect solution will be much smaller, and it could be trivial. So from the random-effect solution, I get the impression that the individual values of the individual responses are trivial, yet their SD via the covparms is substantial. It seems I have no choice but to inflate the values of the random-effect solution so that their SD is the same as that given by the covparm.

SteveDenham
Jade | Level 19

Your approach to homogeneity is exactly what I wish was done by validated programs, but it seems that there is a great attraction somehow to keep doing this and using it to decide whether a parametric or a nonparametric testwill be implemented.  The same goes for normality tests.  At least for linear mixed models (not so much for generalized linear mixed models, where it is just, well, stupid), the problem with tests for normality being so dependent on sample size.

 

So what I hear through all of this is that what the mixed model procedures need is some decent interval estimates for the variance components.  I know that I just ran something yesterday with the "cl" option in the RANDOM statement in an attempt to get the limits you are talking about, and it was seemingly ignored by PROC GLIMMIX, with no error or warning in the log or message in the output.

 

SteveDenham

WillTheKiwi
Pyrite | Level 9

Looks like we are on the same page about not testing for whatever. I never do non-parametric tests, by the way. The whole rationale is weird: you're supposed to do them because the data are not normal, yet the distribution of the rank is uniform, not normal. Log transformation usually reduces non-uniformity. Here's a though: what about treating the rank as a count, after subtracting 1 off it, so that you don't predict impossible ranks? Allow for overdispersion. This might work well for rank representing performance, where rank=1 is the best. Residuals vs predicteds would determine if you can achieve reasonable uniformity that way.

 

I can get compatibility limits for the random-effect solution in Glimmix. Here's the code I just tried:

random int/subject=GameID s cl alpha=&alpha;

 

Glimmix won't give CLs for the variances, but it gives standard errors, so I just assume normality, as SAS assumes with nobound in Mixed.

WillTheKiwi
Pyrite | Level 9

It looks like I am not going to get any further help on my problem of whether I rescale the random-effect solution to increase the SD of the solution to make it the same as the square root of the corresponding covparm, so I've done some more simulations to try to figure it out. I found that the compatibility intervals for the individual responses provided by the random-effect solution appear to be accurate; that is, with 90% intervals, on average the intervals contain the sample true value of the individual response 90% of the time, even though the SD of the random-effect solution is way less than the square root of the covparm (which happens when there are only a few degrees of freedom for the covparm). So although the individual estimated values are shrunk, the compatibility intervals capture the true values correctly. I therefore think it's safe to rescale the random-effect solution and the compatibility limits. It amounts to a correction for shrinkage.

SteveDenham
Jade | Level 19

@WillTheKiwi  Could you help me out here?  I need a good definition of compatibility limits, especially as compared to confidence limits and tolerance limits.  Without that, I have some hesitancy about assuming symmetric bounds on the variance components, as I would expect some kind of scaled chi-squared distribution rather than a t distribution.

 

SteveDenham

WillTheKiwi
Pyrite | Level 9

Compatibility is Sander Greenland's ("retire statistical significance") preferred term for confidence. The values of the effect in the interval are most compatible with the data and the model (and all the assumptions underlying the model and the sampling). It's his way of keeping the model and the assumptions up front. Which is fine, but in the end practitioners still have to interpret the interval as the likely range of the true effect, by implicitly adopting a Bayesian assessment with a minimally or sufficiently weakly informative prior; in other words, they have to interpret the interval as uncertainty in the true value, or as values that they are most confident the true effect could have. Without those interpretations, the effect has no practical application, in my view, anyway. In plain language, the interval represents how small or big the effect could be, and you interpret that uncertainty with reference to magnitude thresholds: trivial, small +ive and -ive, moderate +ive and -ive, and so on.

WillTheKiwi
Pyrite | Level 9

When you allow negative variance in Mixed, SAS assumes the variances estimated by the random statement are normally distributed. If you don't allow negative variance, SAS assumes they are chi-squared distributed. The residual in Mixed is always chi-squared distributed. I guess in Glimmix you have to decide on the distribution of the random effects yourself, and that's why SAS provides only the standard error. I have always assumed normality. The distribution of the residuals is, of course, given by dist= (and link=), without or with overdispersion given by random _residual_. I hope I have got that right.

WillTheKiwi
Pyrite | Level 9

In case anyone is still interested in this topic, I have resolved the problem of the variance of the random-effect solution (as given by ods output solutionr= in Proc Mixed) being somewhat (and sometimes a lot) less than the variance given by its corresponding covparm (as given by ods output covparms=). I did it by developing another simulation for a study of repeated measurements, where each subject (athletes) has several tests in one season followed by several tests in the next season. The random-effect structure is given by random int Season/subject=Athlete;. This model estimates differences between athletes (the solution for the random Intercept/subject=Athlete, changes within athletes between seasons (the solution for random Season/subject=Athlete), and changes within athletes within seasons (the residuals). The individual solution values for the random effects all have standard errors, and it turns out that the variance of the solution plus the mean standard error squared is equal to the variance given by the covparms. Even the residuals must have standard errors, but I don't know how to estimate those and I assume they account for the difference between the variance of the residuals and residual variance given by the covparms.

 

See below for the code for this simulation. I included lots of code to check that the random-effect solution produces unbiased estimates, and that the 90% compatibility intervals for each solution value did indeed include the true (simulated) value 90% of the time. I could have done it all with the previous code for individual responses, but I wanted to simulate repeated testing of athletes for a project I am helping with at Olympiatoppen (the Norwegian Olympic Sports Center). I'll be using the residuals and possibly the random-effect solution for Season as linear predictors in another mixed model. Their linear effects will be attenuated by the standard errors in each value. These standard errors vary a little bit, depending on the number of repeated measurements each athlete has, but the mean will be given by the difference in the variances described above. I don't have to calculate it, because all I am interested in is the correction for attenuation of a linear predictor that has error; the correction factor is given by 1/ICC (ICC = intraclass correlation coefficient), and the ICC is given by (pure variance)/(observed variance), which here is the variance of the random-effect solution divided by the covparm variance.

 

I won't be allowing negative variances with the upcoming data, so there won't be any problem with interpreting the random-effect solution.

 

Will

 

*simulates repeated tests within each of two seasons;
*some of these macro variables are irrelevant here, copied from another program;

%let Ssize=100; *number of subjects;
%let BtwnSD=4; *true differences between subjects, to be estimated by a random effect;
%let Mean=100; *true grand mean, irrelevant here;
%let WthnErr=2; *error within subjects between tests within seasons, to be estimated by the residual;
%let BtwnErr=1; *extra error within subjects between seasons, to be estimated by a random effect;
%let MaxTestsWthn=4; *maximum number of tests in each season;

%let alpha=0.1;
%let nob=; *make it nob=bound to allow negative variance;
%let convcrit=CONVH=1E-8 convf=1E-8; *make the values 1E-7, 1E-6 or smaller if fail to converge;
%let tvalue=3.5; *residual standardized threshold for outliers;
%let logflag=0; *set to 1 for log transformation of the dependent;
%let deceff=1; *decimal places for effects and SDs;

data dat1;
TestID=0;
do Athlete=1 to &Ssize;
  YtrueRand=&BtwnSD*rannor(0);
  do Season=1;
    BtwnSeasErr=&BtwnErr*rannor(0);
    do TestNo=1 to 2+int((&MaxTestsWthn-1)*ranuni(0));
      WthnSeasErr=rannor(0)*&WthnErr;
      Yobsvd=&Mean+YtrueRand+BtwnSeasErr+WthnSeasErr;
      TestID=TestID+1;
      output;
      end;
    end;
  do Season=2;
    BtwnSeasErr=&BtwnErr*rannor(0); 
    if ranuni(0)>0.5 then do; *this gives 50% chance of any values in Season 2;
    *if ranuni(0)>0 then do; *always have values in Season 2;
      *do TestNo=1 to 2+int((&MaxTestsWthn-1)*ranuni(0)); *on average, as many as in Season 1;
      do TestNo=1 to 1+int((&MaxTestsWthn-1)*ranuni(0)); *sometimes only one test in Season 2;
        WthnSeasErr=rannor(0)*&WthnErr;
        Yobsvd=&Mean+YtrueRand+BtwnSeasErr+WthnSeasErr;
        TestID=TestID+1;
        output;
        end;
      end;
    end;
  end;

title "Values for first three athletes";
proc print data=dat1;
where Athlete<4;
format YtrueRand BtwnSeasErr WthnSeasErr Yobsvd 5.2;
run;

ods select none;
title "Repeated tests over two seasons, mixed model";
proc mixed data=dat1 covtest cl alpha=&alpha  &nob &convcrit;
class Athlete Season;
model Yobsvd=/s outp=pred residual alphap=&alpha ddfm=sat ;
random int Season/subject=Athlete s cl alpha=&alpha; * type=un;*random intercepts and slopes model;
*lsmeans Mean Gender/cl alpha=&alpha;
*estimate "Male/female" Gender 1 -1/cl alpha=&alpha;
ods output covparms=cov;
ods output estimates=est;
ods output lsmeans=lsm;
ods output solutionr=solr; 
ods output solutionf=solf; 
ods output classlevels=clev;
*by Sport;
*where Sport="Endur";
run;
ods select all;

data cov1;
set cov;
DegFree=2*Zvalue**2;

title2 "Random effects as variances";
title3 "True values: Intercept=&BtwnSD**2;  Season=&BtwnErr**2;  Residual=&WthnErr**2";
proc print data=cov1;
format _numeric_ 5.&deceff DegFree 5.0;
run;

title2 "Variance of residuals";
proc means data=pred noprint;
var resid;
output out=pred1(drop=_type_ _freq_) n=NoOfObs var=Variance;
run;


proc sort data=solr;
by Effect;

title2 "First 10 observations of Intercept random-effect solution";
proc print data=solr(obs=10);
where Effect="Intercept";
run;

title2 "First 10 observations of Season random-effect solution";
proc print data=solr(obs=10);
where Effect="Season";
run;

data solr0;
set solr;
SEsq=StdErrPred**2;

title2 "Variance of random-effect solutions";
proc means data=solr0 noprint;
var estimate SEsq;
by Effect;
output out=solr1(drop=_type_ _freq_ d) n=NoOfObs var=Variance mean=d SEsq;
where estimate ne 0;
run;

*proc print;run;

data solresid;
set solr1 pred1;
if Effect="" then Effect="Residual";

data all;
merge solresid cov1(keep=estimate degfree rename=(Estimate=CovParmVariance));
VarianceAdjustDF=Variance*(NoOfObs-1)/DegFree;
VarPlusSEsq=Variance+SEsq; 
EstimatedSEsq=CovParmVariance-Variance;
CorrAttenFactor=CovParmVariance/Variance;

title2 "Variance of random-effect solutions and residuals, and covparm variance";
title3 "SEsq is the mean of StdErrPred squared of the solutions";
title4 "The covparm variance is evidently the sum of the variance and SEsq";
title5 "Hence the estimate of SEsq for the residual variance (I can't estimate error in each residual)";
title6 "When a random-effect solution or residuals are used as linear predictors, their effects";
title7 "will need to be corrected upwards by CorrAttenFactor = CovParmVariance/Variance = 1/ICC.";
title8 "Variance adjusted for DF does not work: it's always greater than the covparm variance";
proc print;
var Effect NoOfObs Variance SEsq VarPlusSEsq CovParmVariance EstimatedSEsq
  DegFree VarianceAdjustDF CorrAttenFactor;
format _numeric_ 5.&deceff CorrAttenFactor 5.2 DegFree 5.0;
run;


title2 "Check individual true values of Season against random-effect solution";
*proc print data=solr;run;
data season;
set solr;
if Effect="Season";
keep Athlete Estimate Season Lower Upper;

*proc print;run;

proc sort data=dat1;
by Athlete Season;

data dat2;
set dat1;
if lag(Athlete) ne Athlete or lag(Season) ne Season;
keep Athlete Season BtwnSeasErr;

*proc print;
run;

data season1;
merge season dat2;
by Athlete Season;
if BtwnSeasErr ne .; *needed when a subject has no test in Season 2;
if BtwnSeasErr<Lower or BtwnSeasErr>Upper then Type0error="***";

*proc print;run;
title3 "Type0 error (i.e., when 90%CI for individual estimates does not include true value)";
title4 "Expected value is 10%";
proc freq;
tables Type0error/missing nocum;
run;

title3 "Expected true BtwnSeasErr: StdDev=&BtwnErr; variance=&BtwnErr**2";
proc means n mean std var maxdec=&deceff;
var Estimate BtwnSeasErr;
run;

title3 "True value (BtwnSeasErr) vs solution (Estimate)";
title4 "Line of identity shown in black, indicating no bias in the estimated value";
title5 "(Rerun the analysis to show that any apparent bias is just sampling variation.)";
ods graphics / reset width=14cm height=12cm imagemap attrpriority=none;
proc sgplot data=season1; *uniform=all;
styleattrs  
  datacolors=(red blue white)
  datacontrastcolors=(black)
  datasymbols=(circlefilled); *needs a group= for styleattrs to work;
*scatter x=pred y=Resid/markerattrs=(size=6 symbol=circlefilled) filledoutlinedmarkers
  group=PossessionType;
scatter x=Estimate y=BtwnSeasErr/markerattrs=(size=8 color=black symbol=circlefilled) filledoutlinedmarkers MARKERFILLATTRS=(color=lightgreen); * group=Gender;
scatter x=BtwnSeasErr y=BtwnSeasErr/markerattrs=(size=3 color=black symbol=circlefilled);
*scatter x=pred y=Resid/markerattrs=(size=8 color=black symbol=circlefilled) filledoutlinedmarkers MARKERFILLATTRS=(color=lightgreen);
*scatter x=StrokeNo y=Resid/markerattrs=(size=3 color=black symbol=circlefilled) filledoutlinedmarkers MARKERFILLATTRS=(color=black);
refline 0/axis=y lineattrs=(pattern=dot thickness=1 color=black);
refline 0/axis=x lineattrs=(pattern=dot thickness=1 color=black);
reg x=Estimate y=BtwnSeasErr/degree=1 lineattrs=(pattrn=solid thickness=1 color=blue) legendlabel="linear" nomarkers;
*reg x=pred y=Resid/degree=2 lineattrs=(thickness=1 color=red) legendlabel="quadratic" nomarkers;
*by Sport;
run;
ods graphics / reset;

 

WillTheKiwi
Pyrite | Level 9

I have done further simulations to check what happens when I have another variable that tracks the change scores. The simulations are for real data, where the sport scientists at Olympiatoppen are monitoring changes in biomechanical measures in one kind of test (on a dynamometer) in their athletes, and they want to know the extent to which the changes in those measures tests track changes in the more usual fitness or performance tests (jump height, sprint speed and so on). As I noted previously, the random-effect solutions provide measures of change scores, and their SD were smaller (sometimes by quite a lot) than the SD coming from the covparms. But surprisingly, when I used the random-effect solutions as predictors in a model with the performance test measure as the dependent, I got unbiased estimates of the relationships that I had set up in the data between the biomech and performance measures. So I was wrong about needing to correct for attenuation.

SAS Innovate 2025: Register Today!

 

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 4279 views
  • 1 like
  • 2 in conversation