Solved: Re: Are variance stabilizing transformations appropriate for GLMM?

Nebulus · Posted 08-12-2018 09:39 PM

Due to the assumptions of GLMMs I know that data transformations to meet normality ect. aren't necessary. I couldn't find an answer regarding if performing variance stabilizing transformations are inappropriate however.

I'm using PROC GLIMMIX for insect count data to look at the effects of Climate and Season on counts. I sampled counts on leaves (leaf area used as an offset) over the course of 26 (2 weeks apart) sample intervals. Intervals are categorized as three separate seasons based on mean temperatures. Different sites are used as replicates of Climate type. Sample intervals are a repeated measure. Parasitism rates from the previous interval was used as a covariate for observed host densities in current interval.

My code looks something like this (Let me know if you see an issue here as well I'm new to GLMM and the syntax was difficult).

proc glimmix data=final;
class site interval season climate;
model counts = interval(season) season climate previous_para / offset=leafarea dist=poi link=log s;
random intercept / subject=site;
random interval / subject=climate residual; /* Using this line and the next partitions variability between interval and sites*/
random site(interval) / subject=climate residual;
random _residual_ /* Corrects observed overdispersion*/;
lsmeans season climate / cl pdiff plot=meanplot adjust=tukey;
run;

Using the untransformed data my model fit looks like this.

SAS Output

Fit Statistics
-2 Res Log Pseudo-Likelihood	481.96
Pseudo-AIC	487.96
Pseudo-AICC	488.13
Pseudo-BIC	481.96
Pseudo-CAIC	484.96
Pseudo-HQIC	481.96
Generalized Chi-Square	4454.47
Gener. Chi-Square / DF	31.15

Square root transforming I get these fit statistics. The Pearsons/df looks much better.


-2 Res Log Pseudo-Likelihood	362.94
Pseudo-AIC	368.94
Pseudo-AICC	369.11
Pseudo-BIC	362.94
Pseudo-CAIC	365.94
Pseudo-HQIC	362.94
Generalized Chi-Square	169.92
Gener. Chi-Square / DF	1.19

So my question is, is square root transforming acceptable or no and why?

Many Thanks!

sld · Posted 08-13-2018 07:51 PM

Seems like section could be appropriate to include....

Please describe the input data set. I can guess, based on your code, but knowing for sure would be preferable. I've been thinking that each observation in the data set was a leaf, with integer counts and continuous leafarea and integer previous_para, and that counts would suggest a Poisson (or negative binomial) distribution, being both integer and presumably small-ish. But now I'm thinking that each observation is compiled over multiple leaves (i.e., 30), with mean counts and mean leafarea (and mean previous_para?). (Digression: are you sure that you want to compute means by previous_para? I would think you would want the mean of previous_para, to match with counts and leafarea. And doesn't your second proc means compute means over n=1, and hence is unnecessary?)

Theoretically, the Poisson distribution is appropriate for integer values. Your mean counts are not integers, but GLIMMIX will run with them anyway. (Some R packages will not.) As the Poisson mean increases, the distribution can be approximated by the normal distribution; your means are large, at least in the small table you've included, so you could ponder trying a model that assumes normality.

In proc MIXED, you have RANDOM and REPEATED statements. In proc GLIMMIX you have RANDOM and RANDOM/RESIDUAL statements, correspondingly. A model always has an R matrix; it may (if it is a mixed model) have a G matrix, or it may not.

The structure of the data set needs to be reflected in the RANDOM statements in GLIMMIX, so it's critical that the person that suggests code (i.e., I) get the data set structure right. For example, if my supposition about the actual data set structure is correct, then I would add RESIDUAL to the second RANDOM statement in my suggested code (or to the suggested RANDOM statement for temporal autocorrelation).

I'm surprised that removing all RANDOM statements results in non-convergence. But I'd have to see the actual code to have any insight. Regardless, your experimental design requires a mixed model so it could be a moot point.

There is always something more to learn about mixed models, particularly generalized. And the goalposts keep moving.

View solution in original post

sld · Posted 08-13-2018 12:37 AM

If I were to answer only your last question, I would say, No. But I think there are many additional considerations.

Because the variance is a function of the mean for the Poisson distribution (specifically, the variance is equal to the mean), there may be no need for "variance stabilization" as in models assuming a normal distribution; the variance will "naturally" increase as the mean increases. However, it is possible that overdispersion exists for a particular data set in a particular model; there are different ways to deal with overdispersion, including using the negative binomial distribution rather than the Poisson or fitting a quasi-Poisson using random _residual_.

I doubt that your current model is correct. We first need clarification on the design: did you measure count on the same leaf in each of the 13 intervals, or did you measure count on different leaves in different intervals? I presume leaves are nested within sites, but are leaves also clustered by plants (which are nested within sites) or subject to some other design constraint?

Other thoughts:

To achieve count per unit leaf area, you need to use the ln(leafarea) as the offset, not plain leafarea.

When you incorporate previous_para as a simple covariate in the model, you are assuming that previous_para is not affected by either climate or season. That seems unlikely, and if so it is a nontrivial modeling problem. You may want to consider a structural equation model.

"subject=climate" in RANDOM statements is wrong if climate is a fixed effects factor.

The assignment of 13 (consecutive?) intervals to 3 seasons could be arbitrary and hard to justify. Or it could be fine, as long as you can adequately justify how and why you categorized what might otherwise be seen as a continuous-scale variable. You may want to consider regressing on interval and dropping season.

You may want to consider including interactions about climate, season, and interval(season).

Many of the problems that I see your model specification are not specific to using a Poisson distribution rather than a normal distribution. So I recommend careful study of SAS® for Mixed Models, Second Edition which deals largely with normal error models but is still a valuable resource for you. Other model considerations are specific to GL(M)Ms, for which Generalized Linear Mixed Models: Modern Concepts, Methods and Applications is an excellent (if dense) resource. It's not only the syntax that is difficult--the concepts can be as well 🙂

I hope this helps.

Nebulus · Posted 08-13-2018 05:03 AM

This is my first time really dealing with complex mixed models so I appreciate the feedback. It's taken me 100s of hours of reading to get to this point, but I know the model still isn't perfect.

Just to add/answer some of your comments.

Yes right now I am using random _residual_ in the model, but it still seems rather overdispersed. It goes from X2/df of 90>32 when that term is included, but can only ever get close to 1 when I square root transform the data. I've tried a negbin model, but have convergence issues.

30 new leaves are sampled from plants every interval. Site=plant in my case because I only have one study plant per site.. I was taking leaves from three separate sections of the canopy, but we decided to eliminate that from the model due to a list of issues including sections not being true replicates..

1) Yes I am using the log of leaf area.

2) Ive never heard of a structural equation model, so I will do some research into it.

3) Yes I will remove subject=climate. I originally had it as "random interval / subject=site residual" The intent was to treat interval as a repeated measure of site. I think that way was correct?

4) There is a lot more going on here with seasons I didn't really get into. I showed climate types vary by temperature and so do intervals, so if I see seasonal/climate differences they can in part be attributed to temperature differences. I can't directly model temperature because temperature is confounded by site due to a single temperature point corresponding to a site. Intervals are assigned to seasons based on mean monthly temperatures (0-15 degrees, 15.01-20, and 20.01-25). Assigning intervals to arbitrary seasons such as Winter Fall Spring Summer doesn't help us interpret seasonal effects in this case, and southern CA doesn't have traditional seasons.

5) I got rid of a climate*season interaction a while ago because p-values are huge for it. I'll have to consider an interval interaction, but I think with 26 intervals it would be difficult to interpret.

sld · Posted 08-13-2018 05:41 PM

Yes right now I am using random _residual_ in the model, but it still seems rather overdispersed. It goes from X2/df of 90>32 when that term is included, but can only ever get close to 1 when I square root transform the data. I've tried a negbin model, but have convergence issues.

For various problems, including non-convergence, read Tips and Strategies for Mixed Modeling with SAS/STAT® Procedures and Advanced Techniques for Fitting Mixed Models Using SAS/STAT® Software. But first, you'll want to be sure you are using the correct syntax, there is scant point in worrying about non-convergence in a wrong model.

30 new leaves are sampled from plants every interval. Site=plant in my case because I only have one study plant per site.. I was taking leaves from three separate sections of the canopy, but we decided to eliminate that from the model due to a list of issues including sections not being true replicates..

I would think that you could include section as a fixed effects factor, although it is true that levels of section cannot be randomly assigned to the 3 plant "divisions". (We generally don't randomly assign gender to animals either, and yet we still use gender as a predictor variable.) But perhaps you are thinking of section as a random effects factor, in which case section is a subsample, not a true replicate, as are individual leaves.

Do you have 10 leaves from each of the 3 sections? If not, and if section matters, then you could have a bias problem.

2) Ive never heard of a structural equation model, so I will do some research into it.

For a biology-oriented introduction see Ch 8 by Grace, Scheiner, and Schoolmaster in Ecological Statistics Contemporary theory and application.

3) Yes I will remove subject=climate. I originally had it as "random interval / subject=site residual" The intent was to treat interval as a repeated measure of site. I think that way was correct?

Yes, interval is a factor associated with repeated measurements on sites. That would be closer to correct, but you are not using the residual option correctly. Only RANDOM statements that specify elements of the R matrix should include the residual option; RANDOM statements that specify elements of the G matrix should not. Site and repeated measures on sites are specified in G; leaves are specified in R.

4) There is a lot more going on here with seasons I didn't really get into. I showed climate types vary by temperature and so do intervals, so if I see seasonal/climate differences they can in part be attributed to temperature differences. I can't directly model temperature because temperature is confounded by site due to a single temperature point corresponding to a site. Intervals are assigned to seasons based on mean monthly temperatures (0-15 degrees, 15.01-20, and 20.01-25). Assigning intervals to arbitrary seasons such as Winter Fall Spring Summer doesn't help us interpret seasonal effects in this case, and southern CA doesn't have traditional seasons.

If temperature is a site-level fixed effects factor, you can model it (that's what mixed models are good at) as a regression. If climate is just a categorized version of temperature, then I would consider using temperature instead. But climate as a factor may be a more complex concept, involving something more than just temperature.

I do not have a full understanding of how and why you are defining "season"; it still strikes me as an arbitrary categorization. But moving forward, consider a model with only interval (omitting season) for the moment because it is simpler than a model with intervals nested within seasons:

proc glimmix data=final;
  class site climate interval;
  ln_leafarea = log(leafarea);
  model counts = climate interval climate*interval 
    / offset = ln_leafarea dist=negbin;
  random intercept / subject=site(climate);
  random interval / subject=site(climate); 
run;

You could explore different forms of temporal autocorrelation among the repeated measurements by replacing the two RANDOM statements above with something like

  random interval / subject=site(climate) type=<whatever>;

As you note, interpretation of a factor with 26 levels is a challenge. Rather than incorporate interval as a CLASS factor, you might regress on interval, possibly with a curvilinear model, possibly with a spline model (see https://blogs.sas.com/content/iml/2017/04/19/restricted-cubic-splines-sas.html) or other smoother (see https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_glimmix_sec...

Lots of different ways to envision the analysis. The challenge is knowing what possibilities exist, identifying the ones that validly represent the structure of the experiment, and then selecting a "best" among those while maintaining objectivity. (Frank Harrell has written, "Using the data to guide the data analysis is almost as dangerous as not doing so.")

Is previous_para the value for counts in the previous interval? Or does counts measure one insect species and previous_para a different insect species?

I hope this helps. If you have access to a statistician at your institution, take advantage of that opportunity.

Nebulus · Posted 08-13-2018 06:41 PM

Thanks a lot for your in depth reply and suggestions.

Sections are determined by canopy height, 0-33%, 34-66%, and 67-100%. I then took 10 leaves randomly selected from each section based on a direction (0-360 degrees). This is part of my dissertation and my PI wasn't too keen on including section, but I feel a lot of the variation in counts can be captured by section because the insect likes the shaded parts of the plants and densities decline as canopy height increases. I've been using averages of 7020 observations (30 leaves x 9 sites x 26 intervals) for counts across the study period. This breaks down into 234 means that go into the model and looks like this (subset) for each interval.

proc sort;
by interval season climate site logarea previous_para; 
proc means mean n noprint;
var total logarea;
by interval season climate site previous_para;
output out=new mean=mnqtot mnarea n=n;
run;
proc sort data=new;
by interval season climate site previous_para;
run;
proc means data=new mean n noprint;
var mnqtot mnarea;
by interval season climate site previous_para;
output out=final mean=totalfin areafin n=n;
run;
proc sort data=final; 
by interval season climate site previous_para; 
run;
proc print data=final;
run;

Obs	interval	season	climate	site	previous_para	totalfin	areafin	n
10	2	Warm	Coastal	1	0.660	10.67	3.09633	1
11	2	Warm	Coastal	2	0.912	306.97	4.24529	1
12	2	Warm	Coastal	3	0.495	514.60	3.05256	1
13	2	Warm	Coastal	4	0.722	352.17	2.79876	1
14	2	Warm	Coastal	5	0.426	506.83	3.52447	1
15	2	Warm	Inland	8	0.648	139.67	3.43376	1
16	2	Warm	Inland	9	0.801	73.43	2.91858	1
17	2	Warm	Interm	6	0.680	214.57	3.32450	1
18	2	Warm	Interm	7	0.429	87.10	2.69907	1

@sld wrote:

Yes, interval is a factor associated with repeated measurements on sites. That would be closer to correct, but you are not using the residual option correctly. Only RANDOM statements that specify elements of the R matrix should include the residual option; RANDOM statements that specify elements of the G matrix should not. Site and repeated measures on sites are specified in G; leaves are specified in R.

This point definitely may be the issue. I was interpreting the random statement in PROC GLIMMIX to be equivalent to the repeated statement in PROC MIXED, so anytime you had a repeated measurement it had to be included.

For example on page 9 of the advanced techniques for fitting Mixed models you cited, it says..

"Or, suppose you have the following REPEATED statement in PROC MIXED with a repeated effect of Time:

repeated Time / subject=Block type=ar(1);

You can replace that statement with the following RANDOM statement in PROC GLIMMIX:

random Time / subject=Block type=ar(1) residual;"

Removing the random statement prevents even the most simplified of this model from converging.. So you are saying if I'm not using leaf counts, but site averages, I have no R-sided effects with my current model?

Yes climate designations aren't just temperature. They are based off of southern California climate zones. Coastal (USDA Hardiness Zone 10b; Sunset Climate Zone 24), intermediate (USDA Hardiness Zone 10a; Sunset Climate Zones 20–22), and inland (USDA Hardiness Zones 9a/9b; Sunset Climate Zones 18–19).

Counts are of the pest insect (whitefly). Previous parasitism is total parasitism of the whitefly by 3 parasitic wasps observed from the previous interval that could be affecting its densities at the subsequent interval.

It looks like I still have a lot of reading to do on adding regressions/splines to a mixed model.

Thanks!

sld · Posted 08-13-2018 07:51 PM

Seems like section could be appropriate to include....

Please describe the input data set. I can guess, based on your code, but knowing for sure would be preferable. I've been thinking that each observation in the data set was a leaf, with integer counts and continuous leafarea and integer previous_para, and that counts would suggest a Poisson (or negative binomial) distribution, being both integer and presumably small-ish. But now I'm thinking that each observation is compiled over multiple leaves (i.e., 30), with mean counts and mean leafarea (and mean previous_para?). (Digression: are you sure that you want to compute means by previous_para? I would think you would want the mean of previous_para, to match with counts and leafarea. And doesn't your second proc means compute means over n=1, and hence is unnecessary?)

Theoretically, the Poisson distribution is appropriate for integer values. Your mean counts are not integers, but GLIMMIX will run with them anyway. (Some R packages will not.) As the Poisson mean increases, the distribution can be approximated by the normal distribution; your means are large, at least in the small table you've included, so you could ponder trying a model that assumes normality.

In proc MIXED, you have RANDOM and REPEATED statements. In proc GLIMMIX you have RANDOM and RANDOM/RESIDUAL statements, correspondingly. A model always has an R matrix; it may (if it is a mixed model) have a G matrix, or it may not.

The structure of the data set needs to be reflected in the RANDOM statements in GLIMMIX, so it's critical that the person that suggests code (i.e., I) get the data set structure right. For example, if my supposition about the actual data set structure is correct, then I would add RESIDUAL to the second RANDOM statement in my suggested code (or to the suggested RANDOM statement for temporal autocorrelation).

I'm surprised that removing all RANDOM statements results in non-convergence. But I'd have to see the actual code to have any insight. Regardless, your experimental design requires a mixed model so it could be a moot point.

There is always something more to learn about mixed models, particularly generalized. And the goalposts keep moving.

Nebulus · Posted 08-13-2018 08:14 PM

Here is a sample of the input data for site 9 (which is an inland climate type) at interval 26. The log of leaf area is taken after the datalines step.

input site leaf climate$ interval count leafarea previous_para;

if interval = 1 then season="Warm";
else if interval =2 then season="Warm";
else if interval =3 then season="Warm";
else if interval =4 then season="Warm";
else if interval =5 then season="Warm";
else if interval =6 then season="Warm";
else if interval =7 then season="Warm";
else if interval =8 then season="Mod";
else if interval =9 then season="Cool";
else if interval =10 then season="Cool";
ect. ect.

---------------------------------------------------------

9   6992   Inland   26   0   10.767   0.708
9   6993   Inland   26   14   8.985   0.708
9   6994   Inland   26   26   10.001   0.708
9   6995   Inland   26   0   23.441   0.708
9   6996   Inland   26   67   24.972   0.708
9   6997   Inland   26   5   20.839   0.708
9   6998   Inland   26   23   19.180   0.708
9   6999   Inland   26   27   17.171   0.708
9   7000   Inland   26   96   28.089   0.708
9   7001   Inland   26   0   11.887   0.708
9   7002   Inland   26   63   18.145   0.708
9   7003   Inland   26   0   26.068   0.708
9   7004   Inland   26   0   26.864   0.708
9   7005   Inland   26   0   13.516   0.708
9   7006   Inland   26   0   15.965   0.708
9   7007   Inland   26   0   14.251   0.708
9   7008   Inland   26   26   15.475   0.708
9   7009   Inland   26   0   25.854   0.708
9   7010   Inland   26   57   42.570   0.708
9   7011   Inland   26   8   19.700   0.708
9   7012   Inland   26   35   29.558   0.708
9   7013   Inland   26   54   28.970   0.708
9   7014   Inland   26   9   20.282   0.708
9   7015   Inland   26   13   17.331   0.708
9   7016   Inland   26   42   17.435   0.708
9   7017   Inland   26   3   20.557   0.708
9   7018   Inland   26   0   27.391   0.708
9   7019   Inland   26   0   29.681   0.708
9   7020   Inland   26   0   24.170   0.708

The mean previous parasitism for each site is already calculated at a specific subsequent interval (in this case 70.8%, from site 9 during interval 25) and included in the data input so there is no need to average it again in proc means. Yes the second proc means in unnecessary.

Proc Means would spit out this observation for that specific site/interval to run in the model.

Obs	interval	season	climate	site	previous_para	totalfin	areafin
232	26	Warm	Inland	9	0.708	18.93	2.95681

This is the current code that works.

proc glimmix data=final ;
class site interval season climate;
model totalfin =  interval(season) season climate  / offset=areafin dist=poi link=log solution;
random intercept / subject=site(climate);
random interval(season) / subject=site(climate) residual;
random _residual_;
lsmeans season climate / cl pdiff plot=meanplot adjust=tukey;
run;

P.S. I've tried stats counseling on campus, but all they ever provide me is stats Ph.D. students who I've found don't have the expertise yet to really solve these difficult questions, especially when programming is involved. So this is more progress than I've been able to achieve in the last month 🙂

sld · Posted 08-14-2018 01:10 PM

If you want to use the means over leaves in the model fitting process, then I would take it a step further: compute density = totalfin/areafin, make density the response, drop the offset, and assume a normal distribution (possibly after a log transformation).

The point of the offset is to obtain density with count data. We use the offset because we can't use density per se with a Poisson distribution (which is a distribution for integer counts). Except that GLIMMIX does, in fact, allow us to have a non-integer response with a Poisson distribution. Hmm. So it runs, and it may work OK, or it may not, but it does not sit well with me. I would use leaf-level observations with a Poisson/negative binomial distribution, or means-over-leaves-level observations with a normal distribution. You would expect that, if both analyses ran well and correctly, you would get the same story from each.

The advantage of the means-over-leaves-level observations with a normal distribution approach is that the design structure of the model is simpler (there's no hierarchical level for leaves) and that the normal distribution generally behaves better (with respect to convergence, etc.). If the number of leaves is very unbalanced, then I'd prefer the leaf-level with Poisson/NB approach.

I believe in across-the-table statistical consultation, so I do like to make that suggestion. But I also know that what you really need is an applied statistician with expertise in your general field of study (for example, biology rather than sociology). Consulting centers staffed by stat grad students are excellent for the stat grad students, often not so great for the clients; it depends on the extent to which stat faculty are involved when the grad student has hit his/her knowledge limit. At my institution, there is a staff statistician in biological sciences, a staff statistician in education, and a staff statistician in agricultural sciences, supported by those administrative units and serving clients from only those units, not the institution as a whole. Researchers in those units are very happy.

If you are at a land grant institution with an agricultural experiment station, see whether the AES has a statistical staff (many do) and whether they might be able to help.

I hope this continues to help.

Nebulus · Posted 08-14-2018 03:13 PM

Yeah I was originally not using an offset but using densities as my dependent variable. I will try the normal distribution after seeing if I can normalize the data. I think I was fixated too much on using the poisson even though my data aren't integers.

Thanks again for being so generous with your time, I think this information is more than enough for me to troubleshoot my issues. I'm currently at a smaller UC school but I'll be moving to Cornell to do a post-doc in September, so if I'm still having issues hopefully the consulting there is superior to where I currently am.

sld · Posted 08-14-2018 04:00 PM

Great! And congratulations on your post-doc.