Solved: which transformation to choose if the data is not normal distributed ...

kellychan84 · Posted 08-20-2021 03:07 PM

proc glimmix data=DF_digestibility; 
  class treatment block;
  model fIDFdigestibility = treatment/dist=lognormal; 
  random block; 
  lsmeans treatment/tdiff lines;
  ods output lsmeans=third;
  covtest "block=0" 0 .;
  output out=second predicted=pred residual=resid residual(noblup)=mresid student=sresid student(noblup)=smresid;
  title "1-Fecal Insoluble Dietary Fiber Digestibility (%) ANOVA Results";
run;
data BackTransformedData; 
  set third;
  btlsmean = exp(estimate) ;
  btse_mean = exp(estimate)*stderr;
run;
proc print data=second;
run;
proc sgplot data=second;
  scatter y=smresid x=pred;
  refline 0;
run;
proc sgplot data=second;
  scatter y=smresid x=treatment;
  refline 0;
run;
proc sgplot data=second;
  vbox smresid/group=treatment datalabel;
run;
proc sgscatter data=second;
  plot sresid*(pred treatment block);
run;
proc univariate data=second normal plot;
  var sresid;
  histogram sresid / normal kernel;
run;

SteveDenham · Posted 08-23-2021 09:36 AM

What you learned in class is appropriate for linear models, but not for mixed models. This code that you ran:

proc glimmix data=DF_digestibility;
class treatment block;
model fIDFdigestibility = treatment;
random block; 
random _residual_/group=treatment; 
lsmeans treatment/tdiff lines;
covtest homogeneity; 
output out=second predicted=pred residual=resid residual(noblup)=mresid student=sresid student(noblup)=smresid;
title "1-Fecal Insoluble Dietary Fiber Digestibility (%) ANOVA Results";
run;

is spot-on. You have accounted for the lack of homogeneity in the two groups with the random _residual_/group=treatment. You don't need to transform if heterogeneity was your only concern. But there is something else happening I think. Your response is a percent, based on the ratio of two continuous variables and is contained in the open interval from 0 to 1. I would recommend a fit with a beta distribution, such as this:

proc glimmix data=DF_digestibility;
class treatment block;
model fIDFdigestibility = treatment/dist=beta;
random block; 
random _residual_/group=treatment; 
lsmeans treatment/diff lines ilink;
covtest homogeneity; 
output out=second predicted=pred residual=resid residual(noblup)=mresid student=sresid student(noblup)=smresid;
title "1-Fecal Insoluble Dietary Fiber Digestibility (%) ANOVA Results";
run;

That should take care of your concerns (at least based on the shared box plot). The default link for a beta distribution is logistic (so I would suppose that the transformation to choose is logistic). However, it makes more sense to talk about the distribution in this case.

I wouldn't be surprised if the COVTEST was not significant for your data, assuming a beta distribution..

SteveDenham

View solution in original post

PaigeMiller · Posted 08-20-2021 03:30 PM

The variable fIDFdigestibility does not have to be normally distributed in order to be modeled properly by PROC GLIMMIX. If you want to use a normal distribution when fitting the model, only the RESIDUALS have to be normally distributed. Naturally, if you are fitting the model based upon some other (non-normal) distribution, then you need to have fIDFdigestibility follow that distribution. It's really not clear if you intend to fit the model assuming a normal distribution.

--
Paige Miller

kellychan84 · Posted 08-20-2021 03:48 PM

@PaigeMiller thank you for your reply. You are right! It is the RESIDUALS that have to be normally distributed. I think the normality result from SAS I attached is for the RESIDUALS.

Let me explain my question in another way. I want to transform my data here because the RESIDUALS are showing unequal variance between two treatment groups when I run the homogeneity test (code below).

proc glimmix data=DF_digestibility;
class treatment block;
model fIDFdigestibility = treatment;
random block;
random _residual_/group=treatment;
lsmeans treatment/tdiff lines;
covtest homogeneity;
output out=second predicted=pred residual=resid residual(noblup)=mresid student=sresid student(noblup)=smresid;
title "1-Fecal Insoluble Dietary Fiber Digestibility (%) ANOVA Results";
run;

jiltao · Posted 08-20-2021 04:10 PM

The COVTEST result tells you that the variances are different among the treatment groups. So your RANDOM _RESIDUAL_ / GROUP=TREATMENT; statement is necessary, because it fits an unequal variance model to your data. You do not need to do data transformations. Constant variance is never one of the model assumptions in PROC MIXED / PROC GLIMMIX. You just model it, which is what your RANDOM _RESIDUAL_ statement does.

Hope this helps,

Jill

kellychan84 · Posted 08-20-2021 04:38 PM

@jiltao Thank you for your answer. I am still confusing.

If I find out the variances between the two treatment groups are unequal, am I supposed to transform them before I compare them by F test? The unequal covariances are also indicated in the below plots as data are showing some patterns or skewed to one side.

sbxkoenk · Posted 08-20-2021 05:00 PM

Hello @kellychan84 ,

I cannot say it better than @jiltao 2 posts above (unequal variance model), but maybe you understand it better if @SteveDenham tells you the same (heterogeneous variance model). 😉😉

See here:

https://communities.sas.com/t5/Statistical-Procedures/Is-there-a-way-to-test-for-equal-variances-in-...

Kind regards,

Koen

SteveDenham · Posted 08-23-2021 09:05 AM

The key to all of this is in @jiltao 's post:

Constant variance is never one of the model assumptions in PROC MIXED / PROC GLIMMIX.

You model the heterogeneity, and stop testing for it, or even really examining it. The assumption of homogeneous variance is NOT important in mixed model analyses.

SteveDenham

kellychan84 · Posted 08-23-2021 09:21 AM

@SteveDenham Thank you very much for your answer, Steve! I was told in my statistical class that you should always check your residuals (random, independent of treatment and design effects, homogenous) before we look at the results. And if you can see some patterns in your residuals, you should transform your data to a better model before you can compare the treatment effect. I did see some patterns in the residuals in this specific data set. What are your suggestions on this? Thanks!

SteveDenham · Posted 08-23-2021 09:36 AM

What you learned in class is appropriate for linear models, but not for mixed models. This code that you ran:

proc glimmix data=DF_digestibility;
class treatment block;
model fIDFdigestibility = treatment;
random block; 
random _residual_/group=treatment; 
lsmeans treatment/tdiff lines;
covtest homogeneity; 
output out=second predicted=pred residual=resid residual(noblup)=mresid student=sresid student(noblup)=smresid;
title "1-Fecal Insoluble Dietary Fiber Digestibility (%) ANOVA Results";
run;

is spot-on. You have accounted for the lack of homogeneity in the two groups with the random _residual_/group=treatment. You don't need to transform if heterogeneity was your only concern. But there is something else happening I think. Your response is a percent, based on the ratio of two continuous variables and is contained in the open interval from 0 to 1. I would recommend a fit with a beta distribution, such as this:

proc glimmix data=DF_digestibility;
class treatment block;
model fIDFdigestibility = treatment/dist=beta;
random block; 
random _residual_/group=treatment; 
lsmeans treatment/diff lines ilink;
covtest homogeneity; 
output out=second predicted=pred residual=resid residual(noblup)=mresid student=sresid student(noblup)=smresid;
title "1-Fecal Insoluble Dietary Fiber Digestibility (%) ANOVA Results";
run;

That should take care of your concerns (at least based on the shared box plot). The default link for a beta distribution is logistic (so I would suppose that the transformation to choose is logistic). However, it makes more sense to talk about the distribution in this case.

I wouldn't be surprised if the COVTEST was not significant for your data, assuming a beta distribution..

SteveDenham

kellychan84 · Posted 08-23-2021 10:18 AM

Thank you very much @SteveDenham! very clear explanation!

Can I consult you with one more question? If the result is showing equal variance, in this case, can I use the following code without " random _residual_ / group =treatment" and " covtest homogeneity ".

proc glimmix data=DF_digestibility;
class treatment block;
model fIDFdigestibility = treatment /dist=beta; ;
random block; 
lsmeans treatment/diff lines ilink;
output out=second predicted=pred residual=resid residual(noblup)=mresid student=sresid student(noblup)=smresid;
title "1-Fecal Insoluble Dietary Fiber Digestibility (%) ANOVA Results";
run;

SteveDenham · Posted 08-23-2021 10:29 AM

Yes, that is often the direction taken.

Just a note, I would imagine that no substantial differences in the results are noticeable whether those lines are kept in or removed. My feeling here is to always assume that there is some heterogeneity in variance between groups, and it is worth keeping the code in to get a more empirical estimate of the variability within groups. On the other hand, dropping that code almost always results in a more stable system of the mixed model equations, eliminating Hessian issues and nonpositive G matrix concerns in many cases.

SteveDenham

kellychan84 · Posted 08-23-2021 11:54 AM

Thank you so much @SteveDenham!

You are right, adding or removing that coding did not change the results much!

By the way, when I run the code with " dist=beta", it turned out an error: "ERROR: Invalid or missing data." I don't know how to fix this. My response is digestibility which is listed as 1-100% instead of 0-1, does it matter?

proc glimmix data=DF_digestibility;
class treatment block;
model fIDFdigestibility = treatment/dist=beta;
random block; 
random _residual_/group=treatment; 
lsmeans treatment/diff lines ilink;
covtest homogeneity; 
output out=second predicted=pred residual=resid residual(noblup)=mresid student=sresid student(noblup)=smresid;
title "1-Fecal Insoluble Dietary Fiber Digestibility (%) ANOVA Results";
run;

sbxkoenk · Posted 08-23-2021 12:22 PM

Hello @kellychan84 ,

Yes, it matters!

Beta regression is used for proportions and therefore the response variable should be between 0 and 1, noninclusive. Values outside this range are not used in the analysis in PROC GLIMMIX.

Good luck with your analysis,

Koen

SteveDenham · Posted 08-23-2021 12:36 PM

Following up on @sbxkoenk 's remark - divide all of your response values by 100 to get them out of percentage units and into grams digested/grams consumed units (kg digested/kg consumed for large animals). You can do this inside PROC GLIMMIX, but I recommend doing it in a DATA step prior to running PROC GLIMMIX.

SteveDenham

kellychan84 · Posted 08-23-2021 12:39 PM

Sounds great, thank you so much for all your help! I really appreciate it!!

which transformation to choose if the data is not normal distributed and lognormal does not help

Re: which transformation to choose if the data is not normal distributed and lognormal does not hel

Re: which transformation to choose if the data is not normal distributed and lognormal does not hel

Re: which transformation to choose if the data is not normal distributed and lognormal does not hel

Re: which transformation to choose if the data is not normal distributed and lognormal does not hel

Re: which transformation to choose if the data is not normal distributed and lognormal does not hel

Re: which transformation to choose if the data is not normal distributed and lognormal does not hel

Re: which transformation to choose if the data is not normal distributed and lognormal does not hel

Re: which transformation to choose if the data is not normal distributed and lognormal does not hel

Re: which transformation to choose if the data is not normal distributed and lognormal does not hel

Re: which transformation to choose if the data is not normal distributed and lognormal does not hel

Re: which transformation to choose if the data is not normal distributed and lognormal does not hel

Re: which transformation to choose if the data is not normal distributed and lognormal does not hel

Re: which transformation to choose if the data is not normal distributed and lognormal does not hel

Re: which transformation to choose if the data is not normal distributed and lognormal does not hel

Re: which transformation to choose if the data is not normal distributed and lognormal does not hel

SAS Innovate 2025: Call for Content