Model heteroscedasticity directly or use log transformation

RosieSAS · Posted 04-28-2025 05:27 PM

Hi, my data has a heteroscedasticity issue among genotypes. It is a RCBD. I tried two approaches but get different results. Please advise which way is better.

1. Model the heteroscedasticity directly.

data yield;
  input genotype $ rep yield bad_fruit purple_fruit;
  datalines;
  Krewer	1	14.34	72	566
Krewer	2	11.49	53	521
Krewer	3	7.75	37	500Titan	1	10.44	72	470
Titan	2	9.92	79	673
Titan	3	5.97	71	318
T-3557	1	10.51	217	1088
T-3557	2	6.22	132	612
T-3557	3	3.53	90	417
T-2604	1	11.10	13	344
T-2604	2	8.95	10	276
T-2604	3	6.85	14	269
T-3858	1	10.79	29	635
T-3858	2	10.01	35	500
T-3858	3	11.41	21	660
T-3826	1	12.67	77	711
T-3826	2	12.81	59	492
T-3826	3	10.38	24	357
T-3835	1	10.05	65	1021
T-3835	2	10.95	97	1042
T-3835	3	11.73	72	1050
;
run;

proc sgplot data=yield;
  vbox yield/category=genotype;
run;
proc sgplot data=yield;
  vbox bad_fruit/category=genotype;
run;
proc sgplot data=yield;
  vbox purple_fruit/category=genotype;
run;

title "Model heteroscedasticity directly";
proc glimmix data=yield plots=residualpanel;
  ods exclude diffplot linesplot;
  class genotype rep;
  model yield = genotype;
  random rep;
  random int/group=genotype residual; **allow different variance by genotype;
  lsmeans genotype/adj=Tukey pdiff;
  ods output diffs=ppp lsmeans=mmm;
run;

%include '...\SAS Macro\pdmix800.sas'; # the directory where pdmix800 saved;
%pdmix800(ppp,mmm,alpha=.05,sort=yes)

The key results are:

Covariance Parameter Estimates
Cov Parm	Group	Estimate	Standard Error
rep		1.9078	2.2163
Intercept	genotype Krewer	4.6730	5.8042
Intercept	genotype T-2604	1.4080	2.6157
Intercept	genotype T-3557	6.6610	8.4516
Intercept	genotype T-3826	0.04554	0.8463
Intercept	genotype T-3835	4.5581	5.1839
Intercept	genotype T-3858	3.9914	4.0433
Intercept	genotype Titan	1.2386	1.3169

Type III Tests of Fixed Effects
Effect	Num DF	Den DF	F Value	Pr > F
genotype	6	12	8.86	0.0008

genotype Least Squares Means
genotype	Estimate	Standard Error	DF	t Value	Pr > \|t\|
Krewer	11.1933	1.4811	12	7.56	<.0001
T-2604	8.9667	1.0513	12	8.53	<.0001
T-3557	6.7533	1.6900	12	4.00	0.0018
T-3826	11.9533	0.8069	12	14.81	<.0001
T-3835	10.9100	1.4681	12	7.43	<.0001
T-3858	10.7367	1.4023	12	7.66	<.0001
Titan	8.7767	1.0241	12	8.57	<.0001

Effect=genotype Method=Tukey-Kramer(P<.05) Set=1

Obs	genotype	Estimate	Standard Error	Letter Group
1	T-3826	11.9533	0.8069	A
2	Krewer	11.1933	1.4811	AB
3	T-3835	10.9100	1.4681	AB
4	T-3858	10.7367	1.4023	AB
5	T-2604	8.9667	1.0513	B
6	Titan	8.7767	1.0241	B
7	T-3557	6.7533	1.6900	AB

The results from log transformed data are:

title "Log transformed yield data";
data yield_tran;
  set yield;
  log_yield=log(yield);
  log_bad_fruit=log(bad_fruit);
  log_purple_fruit=log(purple_fruit);
run;

proc glimmix data=yield_tran plots=residualpanel;
  ods exclude diffplot linesplot;
  class genotype rep;
  model log_yield = genotype;
  random rep;
  lsmeans genotype/adj=Tukey lines;
run;

Covariance Parameter Estimates
Cov Parm	Estimate	Standard Error
rep	0.03202	0.03924
Residual	0.04978	0.02032

Type III Tests of Fixed Effects
Effect	Num DF	Den DF	F Value	Pr > F
Genotype	6	12	3.12	0.0443

Genotype Least Squares Means
Genotype	Estimate	Standard Error	DF	t Value	Pr > \|t\|
Krewer	2.3842	0.1651	12	14.44	<.0001
T-2604	2.1743	0.1651	12	13.17	<.0001
T-3557	1.8140	0.1651	12	10.99	<.0001
T-3826	2.4761	0.1651	12	14.99	<.0001
T-3835	2.3877	0.1651	12	14.46	<.0001
T-3858	2.3724	0.1651	12	14.37	<.0001
Titan	2.1424	0.1651	12	12.97	<.0001

Tukey-Kramer Grouping for Genotype Least Squares Means (Alpha=0.05)
LS-means with the same letter are not significantly different.
Genotype	Estimate
T-3826	2.4761		A
			A
T-3835	2.3877	B	A
		B	A
Krewer	2.3842	B	A
		B	A
T-3858	2.3724	B	A
		B	A
T-2604	2.1743	B	A
		B	A
Titan	2.1424	B	A
		B
T-3557	1.8140	B

You can see the mean grouping results are different. Which one should I use? Do you have better methods? Both results are reasonable to me. Greater variance of a group makes it harder to find significant difference between other groups. Log transformation stabilizes the variances and also shrinks big values to the central, but weakens the impact of unequal variances compared to modelling the heteroscedasticity directly.

Furthermore, when modeling the heteroscedasticity, sometimes convergency issue would occur, and the residual plots generally not as good as the log transformed data. Looks like transformation is easier to apply.

Any thought is appreciated!

StatDave · Posted 04-28-2025 07:10 PM

Log is not the only, or necessarily best, transformation to use. You can use the Box-Cox method to select a transformation. This can be done using PROC TRANSREG as suggested in the "Box-Cox Transformation" item in the list of Frequently Asked for Statistics (FASTats) at http://support.sas.com/kb/30333 . See the section in the TRANSREG documentation on the Box-Cox transformation.

RosieSAS · Posted 04-29-2025 08:33 AM

Thanks for your suggestion. I'm more concerned about which way should I go, data transformation or modelling the heteroscedasticity?

Season · Posted 04-30-2025 10:23 AM

@RosieSAS wrote:
Thanks for your suggestion. I'm more concerned about which way should I go, data transformation or modelling the heteroscedasticity?

Actually, transforming the dependent variable is a way of dealing with heteroscedasticity. When it comes to dealing with heteroscedasticity, Box-Cox transformation belongs to a kind of remedy named variance-stablizing transformation (VST). As the name suggests, VST can "stablize" the variance of errors such that they can, on many occasions, mitigate heteroscedasticity. Another kind of transformation that belongs to VST is the signed modulus power transformation. As is documented in Heteroskedasticity in Regression: Detection and Correction (Quantitative Applications in the Social ..., signed modulus power transformation has an advantage over the Box-Cox when the dependent variable takes strictly positive values. Although this issue may be fixed by adding a small positive constant to the dependent variable in Box-Cox transformation, the arbitrariness of the constant raises concerns in that different constants might lead to different λ's in the Box-Cox transformation.

For a more in-depth appreciation of heteroscedasticity, you can refer to the book I mentioned in the preceding paragraph, which is the only monograph on heteroscedasticity that I can find. However, it should be pointed out that methods dealing with heteroscedasticity documented therein are not exhaustive. Other methods, including joint mean and variance models, have been developed but are not introduced in this book.

RosieSAS · Posted 04-30-2025 11:21 AM

Thanks for your informative reply.

Season · Posted 05-03-2025 04:35 AM

By the way, section 9.3 of Amazon.com: SAS for Mixed Models, Second Edition: 9781590475003: Littell, Ramon C., Milliken, George... provides demonstrations and examples of dealing with heteroscedasticity in mixed models. In short, this is dealt with two approaches that in essence both belong to the joint mean and variance modeling approach. They are termed as "Power-of-X" and "Power-of-the-Mean" models.

For Power-of-X models, the variance of residuals are modeled in this manner: Var(ei)=σ^2*exp(xγ), with γ being the regression coefficient. In SAS, this can be modeled by codes like:

proc mixed data=xxx;
/*other statements omitted*/
repeated /local=exp(x);
run;

Or

proc nlmixed data=xxx;
/*other statements omitted*/
model y ~ normal(mean,sig2*exp(gamma*x));
run;

The Power-of-the-Mean model assumes that the residuals are proportional to y_hat. Let yi_hat=β0+β1xi1+...+βkxik be the predicted dependent variable for the ith observation. The residuals are assumed to take the form Var(ei)=σ^2*|yi_hat|^θ, where θ is an unknown power parameter. SAS codes for this modeling approach are also documented in the book but are not displayed here because of their complexity. Refer to the book for more details.

By the way, this book now has a newer edition: SAS for Mixed Models: Introduction and Basic Applications: Stroup PH D, Walter W, Milliken PhD, Geor.... However, it is stated in the preface of the newer edition that contents regarding heterogeneous variance models are removed in that edition and are reserved to a later publication. For the time being, however, I have not found this publication.

Ksharp · Posted 04-28-2025 09:20 PM

I think you should take "genotype" as G-side random effect and "rep" as R-side random effect according to your sample data. Like:

data yield;
  input genotype $ rep yield bad_fruit purple_fruit;
  datalines;
Krewer	1	14.34	72	566
Krewer	2	11.49	53	521
Krewer	3	7.75	37	500
Titan	1	10.44	72	470
Titan	2	9.92	79	673
Titan	3	5.97	71	318
T-3557	1	10.51	217	1088
T-3557	2	6.22	132	612
T-3557	3	3.53	90	417
T-2604	1	11.10	13	344
T-2604	2	8.95	10	276
T-2604	3	6.85	14	269
T-3858	1	10.79	29	635
T-3858	2	10.01	35	500
T-3858	3	11.41	21	660
T-3826	1	12.67	77	711
T-3826	2	12.81	59	492
T-3826	3	10.38	24	357
T-3835	1	10.05	65	1021
T-3835	2	10.95	97	1042
T-3835	3	11.73	72	1050
;
run;


title "Model heteroscedasticity directly";
proc glimmix data=yield plots=residualpanel;
  ods exclude diffplot linesplot;
  class genotype rep;
  model yield =  bad_fruit purple_fruit ;
  random rep/subject=genotype  residual;
  random int/subject=genotype ; **allow different variance by genotype;
/*  lsmeans genotype/adj=Tukey pdiff;*/
/*  ods output diffs=ppp lsmeans=mmm;*/
run;

RosieSAS · Posted 04-29-2025 08:36 AM

Thanks for your reply. Here yield, bad_fruit and purple_fruit are all response variables. Genotype is the factor we want to test its effect on these responses. That's how the experiment was designed.

Ksharp · Posted 04-29-2025 09:02 PM

1)

Then build a model only have an intercept term ?
proc glimmix data=yield plots=residualpanel;
ods exclude diffplot linesplot;
class genotype rep;
model yield = ;
random rep/subject=genotype residual;
random int/subject=genotype ; **allow different variance by genotype;
run;

2)

or a model only have an intercept and REP term ?

model yield = rep;

RosieSAS · Posted 04-30-2025 09:00 AM

But genotype is a fixed effect. It should be in MODEL statement. The purpose of this analysis is to test the effect of genotype, it is not a random effect and no random intercepts of genotype.

Ksharp · Posted 04-30-2025 07:59 PM

But from your data structure, it looks like a classic longitude data for mixed model.
You have a SUBJECT variable genotype and REPEATED measure variable rep.

SteveDenham · Posted 05-01-2025 11:59 AM

I would model the heterogeneous variances, at least for yield. The QQ plot is nearly straight, with only a bit of curvature at the low end. That looks to me to be acceptable. Just my opinion. For me, the main issue in this case is the nested nature of the random effects in this model. A question, are all genotypes present in each rep, so that this is an RCB design (3 plots with each having all 7 genotypes present). If that is the case, consider this heterogeneous variance approach:

title "Consider REP as a block, model heteroscedasticity due to genotype at REP level";
proc glimmix data=yield plots=residualpanel noprofile;
  ods exclude diffplot linesplot;
  nloptions maxiter=1000;
  *lnyield=log(yield); /* in case you still want to try a log transformation */
  class genotype rep;
  model yield /*lnyield*/ = genotype;
  random intercept/subject=rep group=genotype;
  lsmeans genotype/adj=simulate(seed=111) diff adjdfe=row;
  covtest homogeneity;
  ods output diffs=ppp lsmeans=mmm;
run;

SteveDenham

RosieSAS · Posted 05-01-2025 02:15 PM

Thanks for your reply. Yes, REP works as blocks in the design just as your understanding. However, I don't fully understand why the heteroscedasticity was model like this. Genotype is not significant anymore. The results are very different. What I knew is that REP is a random effect, variances between genotypes are very different. Could you please explain more about the variance covariance structure? Thanks very much!

RosieSAS · Posted 05-01-2025 02:18 PM

When I add VC to the random statement, it only outputs the variance matrix for one subject, here is the rep 1. How to get the VC for all subjects?

StatDave · Posted 05-01-2025 04:49 PM

With only 3 values in each genotype, there just isn't enough information to confidently estimate and compare the variances. But, for whatever it's worth and as discussed in this note, you can use Levene's test to make the comparison of variances and Welch's test to compare the means adjusting for the differing variances. These suggest no difference in variances; and no difference in means whether assuming equal variances or not.

proc glm; 
class genotype; 
model yield=genotype; 
means genotype/hovtest=levene welch; 
quit;

Model heteroscedasticity directly or use log transformation

Re: Model heteroscedasticity directly or use log transformation

Re: Model heteroscedasticity directly or use log transformation

Re: Model heteroscedasticity directly or use log transformation

Re: Model heteroscedasticity directly or use log transformation

Re: Model heteroscedasticity directly or use log transformation

Re: Model heteroscedasticity directly or use log transformation

Re: Model heteroscedasticity directly or use log transformation

Re: Model heteroscedasticity directly or use log transformation

Re: Model heteroscedasticity directly or use log transformation

Re: Model heteroscedasticity directly or use log transformation

Re: Model heteroscedasticity directly or use log transformation

Re: Model heteroscedasticity directly or use log transformation

Re: Model heteroscedasticity directly or use log transformation

Re: Model heteroscedasticity directly or use log transformation

Registration is open