Hi, my data has a heteroscedasticity issue among genotypes. It is a RCBD. I tried two approaches but get different results. Please advise which way is better.
1. Model the heteroscedasticity directly.
data yield;
input genotype $ rep yield bad_fruit purple_fruit;
datalines;
Krewer 1 14.34 72 566
Krewer 2 11.49 53 521
Krewer 3 7.75 37 500Titan 1 10.44 72 470
Titan 2 9.92 79 673
Titan 3 5.97 71 318
T-3557 1 10.51 217 1088
T-3557 2 6.22 132 612
T-3557 3 3.53 90 417
T-2604 1 11.10 13 344
T-2604 2 8.95 10 276
T-2604 3 6.85 14 269
T-3858 1 10.79 29 635
T-3858 2 10.01 35 500
T-3858 3 11.41 21 660
T-3826 1 12.67 77 711
T-3826 2 12.81 59 492
T-3826 3 10.38 24 357
T-3835 1 10.05 65 1021
T-3835 2 10.95 97 1042
T-3835 3 11.73 72 1050
;
run;
proc sgplot data=yield;
vbox yield/category=genotype;
run;
proc sgplot data=yield;
vbox bad_fruit/category=genotype;
run;
proc sgplot data=yield;
vbox purple_fruit/category=genotype;
run;
title "Model heteroscedasticity directly";
proc glimmix data=yield plots=residualpanel;
ods exclude diffplot linesplot;
class genotype rep;
model yield = genotype;
random rep;
random int/group=genotype residual; **allow different variance by genotype;
lsmeans genotype/adj=Tukey pdiff;
ods output diffs=ppp lsmeans=mmm;
run;
%include '...\SAS Macro\pdmix800.sas'; # the directory where pdmix800 saved;
%pdmix800(ppp,mmm,alpha=.05,sort=yes)
The key results are:
Covariance Parameter Estimates
Cov Parm
Group
Estimate
Standard Error
rep
1.9078
2.2163
Intercept
genotype Krewer
4.6730
5.8042
Intercept
genotype T-2604
1.4080
2.6157
Intercept
genotype T-3557
6.6610
8.4516
Intercept
genotype T-3826
0.04554
0.8463
Intercept
genotype T-3835
4.5581
5.1839
Intercept
genotype T-3858
3.9914
4.0433
Intercept
genotype Titan
1.2386
1.3169
Type III Tests of Fixed Effects
Effect
Num DF
Den DF
F Value
Pr > F
genotype
6
12
8.86
0.0008
genotype Least Squares Means
genotype
Estimate
Standard Error
DF
t Value
Pr > |t|
Krewer
11.1933
1.4811
12
7.56
<.0001
T-2604
8.9667
1.0513
12
8.53
<.0001
T-3557
6.7533
1.6900
12
4.00
0.0018
T-3826
11.9533
0.8069
12
14.81
<.0001
T-3835
10.9100
1.4681
12
7.43
<.0001
T-3858
10.7367
1.4023
12
7.66
<.0001
Titan
8.7767
1.0241
12
8.57
<.0001
Effect=genotype Method=Tukey-Kramer(P<.05) Set=1
Obs
genotype
Estimate
Standard Error
Letter Group
1
T-3826
11.9533
0.8069
A
2
Krewer
11.1933
1.4811
AB
3
T-3835
10.9100
1.4681
AB
4
T-3858
10.7367
1.4023
AB
5
T-2604
8.9667
1.0513
B
6
Titan
8.7767
1.0241
B
7
T-3557
6.7533
1.6900
AB
The results from log transformed data are:
title "Log transformed yield data";
data yield_tran;
set yield;
log_yield=log(yield);
log_bad_fruit=log(bad_fruit);
log_purple_fruit=log(purple_fruit);
run;
proc glimmix data=yield_tran plots=residualpanel;
ods exclude diffplot linesplot;
class genotype rep;
model log_yield = genotype;
random rep;
lsmeans genotype/adj=Tukey lines;
run;
Covariance Parameter Estimates
Cov Parm
Estimate
Standard Error
rep
0.03202
0.03924
Residual
0.04978
0.02032
Type III Tests of Fixed Effects
Effect
Num DF
Den DF
F Value
Pr > F
Genotype
6
12
3.12
0.0443
Genotype Least Squares Means
Genotype
Estimate
Standard Error
DF
t Value
Pr > |t|
Krewer
2.3842
0.1651
12
14.44
<.0001
T-2604
2.1743
0.1651
12
13.17
<.0001
T-3557
1.8140
0.1651
12
10.99
<.0001
T-3826
2.4761
0.1651
12
14.99
<.0001
T-3835
2.3877
0.1651
12
14.46
<.0001
T-3858
2.3724
0.1651
12
14.37
<.0001
Titan
2.1424
0.1651
12
12.97
<.0001
Tukey-Kramer Grouping for Genotype Least Squares Means (Alpha=0.05)
LS-means with the same letter are not significantly different.
Genotype
Estimate
T-3826
2.4761
A
A
T-3835
2.3877
B
A
B
A
Krewer
2.3842
B
A
B
A
T-3858
2.3724
B
A
B
A
T-2604
2.1743
B
A
B
A
Titan
2.1424
B
A
B
T-3557
1.8140
B
You can see the mean grouping results are different. Which one should I use? Do you have better methods? Both results are reasonable to me. Greater variance of a group makes it harder to find significant difference between other groups. Log transformation stabilizes the variances and also shrinks big values to the central, but weakens the impact of unequal variances compared to modelling the heteroscedasticity directly.
Furthermore, when modeling the heteroscedasticity, sometimes convergency issue would occur, and the residual plots generally not as good as the log transformed data. Looks like transformation is easier to apply.
Any thought is appreciated!
... View more