Hi, my data has a heteroscedasticity issue among genotypes. It is a RCBD. I tried two approaches but get different results. Please advise which way is better.
1. Model the heteroscedasticity directly.
data yield;
input genotype $ rep yield bad_fruit purple_fruit;
datalines;
Krewer 1 14.34 72 566
Krewer 2 11.49 53 521
Krewer 3 7.75 37 500Titan 1 10.44 72 470
Titan 2 9.92 79 673
Titan 3 5.97 71 318
T-3557 1 10.51 217 1088
T-3557 2 6.22 132 612
T-3557 3 3.53 90 417
T-2604 1 11.10 13 344
T-2604 2 8.95 10 276
T-2604 3 6.85 14 269
T-3858 1 10.79 29 635
T-3858 2 10.01 35 500
T-3858 3 11.41 21 660
T-3826 1 12.67 77 711
T-3826 2 12.81 59 492
T-3826 3 10.38 24 357
T-3835 1 10.05 65 1021
T-3835 2 10.95 97 1042
T-3835 3 11.73 72 1050
;
run;
proc sgplot data=yield;
vbox yield/category=genotype;
run;
proc sgplot data=yield;
vbox bad_fruit/category=genotype;
run;
proc sgplot data=yield;
vbox purple_fruit/category=genotype;
run;
title "Model heteroscedasticity directly";
proc glimmix data=yield plots=residualpanel;
ods exclude diffplot linesplot;
class genotype rep;
model yield = genotype;
random rep;
random int/group=genotype residual; **allow different variance by genotype;
lsmeans genotype/adj=Tukey pdiff;
ods output diffs=ppp lsmeans=mmm;
run;
%include '...\SAS Macro\pdmix800.sas'; # the directory where pdmix800 saved;
%pdmix800(ppp,mmm,alpha=.05,sort=yes)
The key results are:
Covariance Parameter Estimates | |||
---|---|---|---|
Cov Parm | Group | Estimate | Standard Error |
rep | 1.9078 | 2.2163 | |
Intercept | genotype Krewer | 4.6730 | 5.8042 |
Intercept | genotype T-2604 | 1.4080 | 2.6157 |
Intercept | genotype T-3557 | 6.6610 | 8.4516 |
Intercept | genotype T-3826 | 0.04554 | 0.8463 |
Intercept | genotype T-3835 | 4.5581 | 5.1839 |
Intercept | genotype T-3858 | 3.9914 | 4.0433 |
Intercept | genotype Titan | 1.2386 | 1.3169 |
Type III Tests of Fixed Effects | ||||
---|---|---|---|---|
Effect | Num DF | Den DF | F Value | Pr > F |
genotype | 6 | 12 | 8.86 | 0.0008 |
genotype Least Squares Means | |||||
---|---|---|---|---|---|
genotype | Estimate | Standard Error |
DF | t Value | Pr > |t| |
Krewer | 11.1933 | 1.4811 | 12 | 7.56 | <.0001 |
T-2604 | 8.9667 | 1.0513 | 12 | 8.53 | <.0001 |
T-3557 | 6.7533 | 1.6900 | 12 | 4.00 | 0.0018 |
T-3826 | 11.9533 | 0.8069 | 12 | 14.81 | <.0001 |
T-3835 | 10.9100 | 1.4681 | 12 | 7.43 | <.0001 |
T-3858 | 10.7367 | 1.4023 | 12 | 7.66 | <.0001 |
Titan | 8.7767 | 1.0241 | 12 | 8.57 | <.0001 |
Obs | genotype | Estimate | Standard Error | Letter Group |
---|---|---|---|---|
1 | T-3826 | 11.9533 | 0.8069 | A |
2 | Krewer | 11.1933 | 1.4811 | AB |
3 | T-3835 | 10.9100 | 1.4681 | AB |
4 | T-3858 | 10.7367 | 1.4023 | AB |
5 | T-2604 | 8.9667 | 1.0513 | B |
6 | Titan | 8.7767 | 1.0241 | B |
7 | T-3557 | 6.7533 | 1.6900 | AB |
The results from log transformed data are:
title "Log transformed yield data"; data yield_tran; set yield; log_yield=log(yield); log_bad_fruit=log(bad_fruit); log_purple_fruit=log(purple_fruit); run; proc glimmix data=yield_tran plots=residualpanel; ods exclude diffplot linesplot; class genotype rep; model log_yield = genotype; random rep; lsmeans genotype/adj=Tukey lines; run;
Covariance Parameter Estimates | ||
---|---|---|
Cov Parm | Estimate | Standard Error |
rep | 0.03202 | 0.03924 |
Residual | 0.04978 | 0.02032 |
Type III Tests of Fixed Effects | ||||
---|---|---|---|---|
Effect | Num DF | Den DF | F Value | Pr > F |
Genotype | 6 | 12 | 3.12 | 0.0443 |
Genotype Least Squares Means | |||||
---|---|---|---|---|---|
Genotype | Estimate | Standard Error |
DF | t Value | Pr > |t| |
Krewer | 2.3842 | 0.1651 | 12 | 14.44 | <.0001 |
T-2604 | 2.1743 | 0.1651 | 12 | 13.17 | <.0001 |
T-3557 | 1.8140 | 0.1651 | 12 | 10.99 | <.0001 |
T-3826 | 2.4761 | 0.1651 | 12 | 14.99 | <.0001 |
T-3835 | 2.3877 | 0.1651 | 12 | 14.46 | <.0001 |
T-3858 | 2.3724 | 0.1651 | 12 | 14.37 | <.0001 |
Titan | 2.1424 | 0.1651 | 12 | 12.97 | <.0001 |
Tukey-Kramer Grouping for Genotype Least Squares Means (Alpha=0.05) | |||
---|---|---|---|
LS-means with the same letter are not significantly different. | |||
Genotype | Estimate | ||
T-3826 | 2.4761 | A | |
A | |||
T-3835 | 2.3877 | B | A |
B | A | ||
Krewer | 2.3842 | B | A |
B | A | ||
T-3858 | 2.3724 | B | A |
B | A | ||
T-2604 | 2.1743 | B | A |
B | A | ||
Titan | 2.1424 | B | A |
B | |||
T-3557 | 1.8140 | B |
You can see the mean grouping results are different. Which one should I use? Do you have better methods? Both results are reasonable to me. Greater variance of a group makes it harder to find significant difference between other groups. Log transformation stabilizes the variances and also shrinks big values to the central, but weakens the impact of unequal variances compared to modelling the heteroscedasticity directly.
Furthermore, when modeling the heteroscedasticity, sometimes convergency issue would occur, and the residual plots generally not as good as the log transformed data. Looks like transformation is easier to apply.
Any thought is appreciated!
Log is not the only, or necessarily best, transformation to use. You can use the Box-Cox method to select a transformation. This can be done using PROC TRANSREG as suggested in the "Box-Cox Transformation" item in the list of Frequently Asked for Statistics (FASTats) at http://support.sas.com/kb/30333 . See the section in the TRANSREG documentation on the Box-Cox transformation.
@RosieSAS wrote:
Thanks for your suggestion. I'm more concerned about which way should I go, data transformation or modelling the heteroscedasticity?
Actually, transforming the dependent variable is a way of dealing with heteroscedasticity. When it comes to dealing with heteroscedasticity, Box-Cox transformation belongs to a kind of remedy named variance-stablizing transformation (VST). As the name suggests, VST can "stablize" the variance of errors such that they can, on many occasions, mitigate heteroscedasticity. Another kind of transformation that belongs to VST is the signed modulus power transformation. As is documented in Heteroskedasticity in Regression: Detection and Correction (Quantitative Applications in the Social ..., signed modulus power transformation has an advantage over the Box-Cox when the dependent variable takes strictly positive values. Although this issue may be fixed by adding a small positive constant to the dependent variable in Box-Cox transformation, the arbitrariness of the constant raises concerns in that different constants might lead to different λ's in the Box-Cox transformation.
For a more in-depth appreciation of heteroscedasticity, you can refer to the book I mentioned in the preceding paragraph, which is the only monograph on heteroscedasticity that I can find. However, it should be pointed out that methods dealing with heteroscedasticity documented therein are not exhaustive. Other methods, including joint mean and variance models, have been developed but are not introduced in this book.
By the way, section 9.3 of Amazon.com: SAS for Mixed Models, Second Edition: 9781590475003: Littell, Ramon C., Milliken, George... provides demonstrations and examples of dealing with heteroscedasticity in mixed models. In short, this is dealt with two approaches that in essence both belong to the joint mean and variance modeling approach. They are termed as "Power-of-X" and "Power-of-the-Mean" models.
For Power-of-X models, the variance of residuals are modeled in this manner: Var(ei)=σ^2*exp(xγ), with γ being the regression coefficient. In SAS, this can be modeled by codes like:
proc mixed data=xxx;
/*other statements omitted*/
repeated /local=exp(x);
run;
Or
proc nlmixed data=xxx;
/*other statements omitted*/
model y ~ normal(mean,sig2*exp(gamma*x));
run;
The Power-of-the-Mean model assumes that the residuals are proportional to y_hat. Let yi_hat=β0+β1xi1+...+βkxik be the predicted dependent variable for the ith observation. The residuals are assumed to take the form Var(ei)=σ^2*|yi_hat|^θ, where θ is an unknown power parameter. SAS codes for this modeling approach are also documented in the book but are not displayed here because of their complexity. Refer to the book for more details.
By the way, this book now has a newer edition: SAS for Mixed Models: Introduction and Basic Applications: Stroup PH D, Walter W, Milliken PhD, Geor.... However, it is stated in the preface of the newer edition that contents regarding heterogeneous variance models are removed in that edition and are reserved to a later publication. For the time being, however, I have not found this publication.
I think you should take "genotype" as G-side random effect and "rep" as R-side random effect according to your sample data. Like:
data yield; input genotype $ rep yield bad_fruit purple_fruit; datalines; Krewer 1 14.34 72 566 Krewer 2 11.49 53 521 Krewer 3 7.75 37 500 Titan 1 10.44 72 470 Titan 2 9.92 79 673 Titan 3 5.97 71 318 T-3557 1 10.51 217 1088 T-3557 2 6.22 132 612 T-3557 3 3.53 90 417 T-2604 1 11.10 13 344 T-2604 2 8.95 10 276 T-2604 3 6.85 14 269 T-3858 1 10.79 29 635 T-3858 2 10.01 35 500 T-3858 3 11.41 21 660 T-3826 1 12.67 77 711 T-3826 2 12.81 59 492 T-3826 3 10.38 24 357 T-3835 1 10.05 65 1021 T-3835 2 10.95 97 1042 T-3835 3 11.73 72 1050 ; run; title "Model heteroscedasticity directly"; proc glimmix data=yield plots=residualpanel; ods exclude diffplot linesplot; class genotype rep; model yield = bad_fruit purple_fruit ; random rep/subject=genotype residual; random int/subject=genotype ; **allow different variance by genotype; /* lsmeans genotype/adj=Tukey pdiff;*/ /* ods output diffs=ppp lsmeans=mmm;*/ run;
1)
Then build a model only have an intercept term ?
proc glimmix data=yield plots=residualpanel;
ods exclude diffplot linesplot;
class genotype rep;
model yield = ;
random rep/subject=genotype residual;
random int/subject=genotype ; **allow different variance by genotype;
run;
2)
or a model only have an intercept and REP term ?
model yield = rep;
I would model the heterogeneous variances, at least for yield. The QQ plot is nearly straight, with only a bit of curvature at the low end. That looks to me to be acceptable. Just my opinion. For me, the main issue in this case is the nested nature of the random effects in this model. A question, are all genotypes present in each rep, so that this is an RCB design (3 plots with each having all 7 genotypes present). If that is the case, consider this heterogeneous variance approach:
title "Consider REP as a block, model heteroscedasticity due to genotype at REP level";
proc glimmix data=yield plots=residualpanel noprofile;
ods exclude diffplot linesplot;
nloptions maxiter=1000;
*lnyield=log(yield); /* in case you still want to try a log transformation */
class genotype rep;
model yield /*lnyield*/ = genotype;
random intercept/subject=rep group=genotype;
lsmeans genotype/adj=simulate(seed=111) diff adjdfe=row;
covtest homogeneity;
ods output diffs=ppp lsmeans=mmm;
run;
SteveDenham
With only 3 values in each genotype, there just isn't enough information to confidently estimate and compare the variances. But, for whatever it's worth and as discussed in this note, you can use Levene's test to make the comparison of variances and Welch's test to compare the means adjusting for the differing variances. These suggest no difference in variances; and no difference in means whether assuming equal variances or not.
proc glm;
class genotype;
model yield=genotype;
means genotype/hovtest=levene welch;
quit;
Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.