BookmarkSubscribeRSS Feed
RosieSAS
Quartz | Level 8

Hi, my data has a heteroscedasticity issue among genotypes. It is a RCBD. I tried two approaches but get different results. Please advise which way is better.

1. Model the heteroscedasticity directly.

data yield;
  input genotype $ rep yield bad_fruit purple_fruit;
  datalines;
  Krewer	1	14.34	72	566
Krewer	2	11.49	53	521
Krewer	3	7.75	37	500Titan	1	10.44	72	470
Titan	2	9.92	79	673
Titan	3	5.97	71	318
T-3557	1	10.51	217	1088
T-3557	2	6.22	132	612
T-3557	3	3.53	90	417
T-2604	1	11.10	13	344
T-2604	2	8.95	10	276
T-2604	3	6.85	14	269
T-3858	1	10.79	29	635
T-3858	2	10.01	35	500
T-3858	3	11.41	21	660
T-3826	1	12.67	77	711
T-3826	2	12.81	59	492
T-3826	3	10.38	24	357
T-3835	1	10.05	65	1021
T-3835	2	10.95	97	1042
T-3835	3	11.73	72	1050
;
run;

proc sgplot data=yield;
  vbox yield/category=genotype;
run;
proc sgplot data=yield;
  vbox bad_fruit/category=genotype;
run;
proc sgplot data=yield;
  vbox purple_fruit/category=genotype;
run;

title "Model heteroscedasticity directly";
proc glimmix data=yield plots=residualpanel;
  ods exclude diffplot linesplot;
  class genotype rep;
  model yield = genotype;
  random rep;
  random int/group=genotype residual; **allow different variance by genotype;
  lsmeans genotype/adj=Tukey pdiff;
  ods output diffs=ppp lsmeans=mmm;
run;

%include '...\SAS Macro\pdmix800.sas'; # the directory where pdmix800 saved;
%pdmix800(ppp,mmm,alpha=.05,sort=yes)

The key results are: 


Covariance Parameter Estimates
Cov Parm Group Estimate Standard
Error
rep   1.9078 2.2163
Intercept genotype Krewer 4.6730 5.8042
Intercept genotype T-2604 1.4080 2.6157
Intercept genotype T-3557 6.6610 8.4516
Intercept genotype T-3826 0.04554 0.8463
Intercept genotype T-3835 4.5581 5.1839
Intercept genotype T-3858 3.9914 4.0433
Intercept genotype Titan 1.2386 1.3169

Type III Tests of Fixed Effects
Effect Num DF Den DF F Value Pr > F
genotype 6 12 8.86 0.0008

genotype Least Squares Means
genotype Estimate Standard
Error
DF t Value Pr > |t|
Krewer 11.1933 1.4811 12 7.56 <.0001
T-2604 8.9667 1.0513 12 8.53 <.0001
T-3557 6.7533 1.6900 12 4.00 0.0018
T-3826 11.9533 0.8069 12 14.81 <.0001
T-3835 10.9100 1.4681 12 7.43 <.0001
T-3858 10.7367 1.4023 12 7.66 <.0001
Titan 8.7767 1.0241 12 8.57 <.0001

 

Obs genotype Estimate Standard Error Letter
Group
1 T-3826 11.9533 0.8069 A
2 Krewer 11.1933 1.4811 AB
3 T-3835 10.9100 1.4681 AB
4 T-3858 10.7367 1.4023 AB
5 T-2604 8.9667 1.0513 B
6 Titan 8.7767 1.0241 B
7 T-3557 6.7533 1.6900 AB

 

The results from log transformed data are:

title "Log transformed yield data";
data yield_tran;
  set yield;
  log_yield=log(yield);
  log_bad_fruit=log(bad_fruit);
  log_purple_fruit=log(purple_fruit);
run;

proc glimmix data=yield_tran plots=residualpanel;
  ods exclude diffplot linesplot;
  class genotype rep;
  model log_yield = genotype;
  random rep;
  lsmeans genotype/adj=Tukey lines;
run;
Covariance Parameter Estimates
Cov Parm Estimate Standard
Error
rep 0.03202 0.03924
Residual 0.04978 0.02032

Type III Tests of Fixed Effects
Effect Num DF Den DF F Value Pr > F
Genotype 6 12 3.12 0.0443

Genotype Least Squares Means
Genotype Estimate Standard
Error
DF t Value Pr > |t|
Krewer 2.3842 0.1651 12 14.44 <.0001
T-2604 2.1743 0.1651 12 13.17 <.0001
T-3557 1.8140 0.1651 12 10.99 <.0001
T-3826 2.4761 0.1651 12 14.99 <.0001
T-3835 2.3877 0.1651 12 14.46 <.0001
T-3858 2.3724 0.1651 12 14.37 <.0001
Titan 2.1424 0.1651 12 12.97 <.0001

Tukey-Kramer Grouping for Genotype Least Squares Means (Alpha=0.05)
LS-means with the same letter are not significantly different.
Genotype Estimate  
T-3826 2.4761   A
      A
T-3835 2.3877 B A
    B A
Krewer 2.3842 B A
    B A
T-3858 2.3724 B A
    B A
T-2604 2.1743 B A
    B A
Titan 2.1424 B A
    B  
T-3557 1.8140 B  

You can see the mean grouping results are different. Which one should I use? Do you have better methods? Both results are reasonable to me. Greater variance of a group makes it harder to find significant difference between other groups. Log transformation stabilizes the variances and also shrinks big values to the central, but weakens the impact of unequal variances compared to modelling the heteroscedasticity directly.

Furthermore, when modeling the heteroscedasticity, sometimes convergency issue would occur, and the residual plots generally not as good as the log transformed data. Looks like transformation is easier to apply. 

Any thought is appreciated!

 

15 REPLIES 15
StatDave
SAS Super FREQ

Log is not the only, or necessarily best, transformation to use. You can use the Box-Cox method to select a transformation. This can be done using PROC TRANSREG as suggested in the "Box-Cox Transformation" item in the list of Frequently Asked for Statistics (FASTats) at http://support.sas.com/kb/30333 . See the section in the TRANSREG documentation on the Box-Cox transformation.

RosieSAS
Quartz | Level 8
Thanks for your suggestion. I'm more concerned about which way should I go, data transformation or modelling the heteroscedasticity?
Season
Barite | Level 11

@RosieSAS wrote:
Thanks for your suggestion. I'm more concerned about which way should I go, data transformation or modelling the heteroscedasticity?

Actually, transforming the dependent variable is a way of dealing with heteroscedasticity. When it comes to dealing with heteroscedasticity, Box-Cox transformation belongs to a kind of remedy named variance-stablizing transformation (VST). As the name suggests, VST can "stablize" the variance of errors such that they can, on many occasions, mitigate heteroscedasticity. Another kind of transformation that belongs to VST is the signed modulus power transformation. As is documented in Heteroskedasticity in Regression: Detection and Correction (Quantitative Applications in the Social ..., signed modulus power transformation has an advantage over the Box-Cox when the dependent variable takes strictly positive values. Although this issue may be fixed by adding a small positive constant to the dependent variable in Box-Cox transformation, the arbitrariness of the constant raises concerns in that different constants might lead to different λ's in the Box-Cox transformation.

For a more in-depth appreciation of heteroscedasticity, you can refer to the book I mentioned in the preceding paragraph, which is the only monograph on heteroscedasticity that I can find. However, it should be pointed out that methods dealing with heteroscedasticity documented therein are not exhaustive. Other methods, including joint mean and variance models, have been developed but are not introduced in this book.

RosieSAS
Quartz | Level 8
Thanks for your informative reply.
Season
Barite | Level 11

By the way, section 9.3 of Amazon.com: SAS for Mixed Models, Second Edition: 9781590475003: Littell, Ramon C., Milliken, George... provides demonstrations and examples of dealing with heteroscedasticity in mixed models. In short, this is dealt with two approaches that in essence both belong to the joint mean and variance modeling approach. They are termed as "Power-of-X" and "Power-of-the-Mean" models.

For Power-of-X models, the variance of residuals are modeled in this manner: Var(ei)=σ^2*exp(xγ), with γ being the regression coefficient. In SAS, this can be modeled by codes like:

proc mixed data=xxx;
/*other statements omitted*/
repeated /local=exp(x);
run;

Or

proc nlmixed data=xxx;
/*other statements omitted*/
model y ~ normal(mean,sig2*exp(gamma*x));
run;

The Power-of-the-Mean model assumes that the residuals are proportional to y_hat. Let yi_hat=β0+β1xi1+...+βkxik be the predicted dependent variable for the ith observation. The residuals are assumed to take the form Var(ei)=σ^2*|yi_hat|^θ, where θ is an unknown power parameter. SAS codes for this modeling approach are also documented in the book but are not displayed here because of their complexity. Refer to the book for more details.

By the way, this book now has a newer edition: SAS for Mixed Models: Introduction and Basic Applications: Stroup PH D, Walter W, Milliken PhD, Geor.... However, it is stated in the preface of the newer edition that contents regarding heterogeneous variance models are removed in that edition and are reserved to a later publication. For the time being, however, I have not found this publication.

Ksharp
Super User

I think you should take "genotype" as G-side random effect and "rep" as R-side random effect according to your sample data. Like:

 

data yield;
  input genotype $ rep yield bad_fruit purple_fruit;
  datalines;
Krewer	1	14.34	72	566
Krewer	2	11.49	53	521
Krewer	3	7.75	37	500
Titan	1	10.44	72	470
Titan	2	9.92	79	673
Titan	3	5.97	71	318
T-3557	1	10.51	217	1088
T-3557	2	6.22	132	612
T-3557	3	3.53	90	417
T-2604	1	11.10	13	344
T-2604	2	8.95	10	276
T-2604	3	6.85	14	269
T-3858	1	10.79	29	635
T-3858	2	10.01	35	500
T-3858	3	11.41	21	660
T-3826	1	12.67	77	711
T-3826	2	12.81	59	492
T-3826	3	10.38	24	357
T-3835	1	10.05	65	1021
T-3835	2	10.95	97	1042
T-3835	3	11.73	72	1050
;
run;


title "Model heteroscedasticity directly";
proc glimmix data=yield plots=residualpanel;
  ods exclude diffplot linesplot;
  class genotype rep;
  model yield =  bad_fruit purple_fruit ;
  random rep/subject=genotype  residual;
  random int/subject=genotype ; **allow different variance by genotype;
/*  lsmeans genotype/adj=Tukey pdiff;*/
/*  ods output diffs=ppp lsmeans=mmm;*/
run;
RosieSAS
Quartz | Level 8
Thanks for your reply. Here yield, bad_fruit and purple_fruit are all response variables. Genotype is the factor we want to test its effect on these responses. That's how the experiment was designed.
Ksharp
Super User

1)

Then build a model only have an intercept term ?
proc glimmix data=yield plots=residualpanel;
ods exclude diffplot linesplot;
class genotype rep;
model yield = ;
random rep/subject=genotype residual;
random int/subject=genotype ; **allow different variance by genotype;
run;

 

2)

or a model only have an intercept  and REP term ?

model yield = rep;

RosieSAS
Quartz | Level 8
But genotype is a fixed effect. It should be in MODEL statement. The purpose of this analysis is to test the effect of genotype, it is not a random effect and no random intercepts of genotype.
Ksharp
Super User
But from your data structure, it looks like a classic longitude data for mixed model.
You have a SUBJECT variable genotype and REPEATED measure variable rep.
SteveDenham
Jade | Level 19

I would model the heterogeneous variances, at least for yield. The QQ plot is nearly straight, with only a bit of curvature at the low end. That looks to me to be acceptable. Just my opinion. For me, the main issue in this case is the nested nature of the random effects in this model. A question, are all genotypes present in each rep, so that this is an RCB design (3 plots with each having all 7 genotypes present). If that is the case, consider this heterogeneous variance approach:

 

title "Consider REP as a block, model heteroscedasticity due to genotype at REP level";
proc glimmix data=yield plots=residualpanel noprofile;
  ods exclude diffplot linesplot;
  nloptions maxiter=1000;
  *lnyield=log(yield); /* in case you still want to try a log transformation */
  class genotype rep;
  model yield /*lnyield*/ = genotype;
  random intercept/subject=rep group=genotype;
  lsmeans genotype/adj=simulate(seed=111) diff adjdfe=row;
  covtest homogeneity;
  ods output diffs=ppp lsmeans=mmm;
run;

SteveDenham

 

 

RosieSAS
Quartz | Level 8
Thanks for your reply. Yes, REP works as blocks in the design just as your understanding. However, I don't fully understand why the heteroscedasticity was model like this. Genotype is not significant anymore. The results are very different. What I knew is that REP is a random effect, variances between genotypes are very different. Could you please explain more about the variance covariance structure? Thanks very much!
RosieSAS
Quartz | Level 8
When I add VC to the random statement, it only outputs the variance matrix for one subject, here is the rep 1. How to get the VC for all subjects?
StatDave
SAS Super FREQ

With only 3 values in each genotype, there just isn't enough information to confidently estimate and compare the variances. But, for whatever it's worth and as discussed in this note, you can use Levene's test to make the comparison of variances and Welch's test to compare the means adjusting for the differing variances. These suggest no difference in variances; and no difference in means whether assuming equal variances or not.

proc glm; 
class genotype; 
model yield=genotype; 
means genotype/hovtest=levene welch; 
quit;

sas-innovate-white.png

Missed SAS Innovate in Orlando?

Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.

 

Register now

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 15 replies
  • 995 views
  • 2 likes
  • 5 in conversation