Due to the assumptions of GLMMs I know that data transformations to meet normality ect. aren't necessary. I couldn't find an answer regarding if performing variance stabilizing transformations are inappropriate however.
I'm using PROC GLIMMIX for insect count data to look at the effects of Climate and Season on counts. I sampled counts on leaves (leaf area used as an offset) over the course of 26 (2 weeks apart) sample intervals. Intervals are categorized as three separate seasons based on mean temperatures. Different sites are used as replicates of Climate type. Sample intervals are a repeated measure. Parasitism rates from the previous interval was used as a covariate for observed host densities in current interval.
My code looks something like this (Let me know if you see an issue here as well I'm new to GLMM and the syntax was difficult).
proc glimmix data=final;
class site interval season climate;
model counts = interval(season) season climate previous_para / offset=leafarea dist=poi link=log s;
random intercept / subject=site;
random interval / subject=climate residual; /* Using this line and the next partitions variability between interval and sites*/
random site(interval) / subject=climate residual;
random _residual_ /* Corrects observed overdispersion*/;
lsmeans season climate / cl pdiff plot=meanplot adjust=tukey;
run;
Using the untransformed data my model fit looks like this.
SAS Output
Fit Statistics | |
---|---|
-2 Res Log Pseudo-Likelihood | 481.96 |
Pseudo-AIC | 487.96 |
Pseudo-AICC | 488.13 |
Pseudo-BIC | 481.96 |
Pseudo-CAIC | 484.96 |
Pseudo-HQIC | 481.96 |
Generalized Chi-Square | 4454.47 |
Gener. Chi-Square / DF | 31.15 |
Square root transforming I get these fit statistics. The Pearsons/df looks much better.
-2 Res Log Pseudo-Likelihood | 362.94 |
---|---|
Pseudo-AIC | 368.94 |
Pseudo-AICC | 369.11 |
Pseudo-BIC | 362.94 |
Pseudo-CAIC | 365.94 |
Pseudo-HQIC | 362.94 |
Generalized Chi-Square | 169.92 |
Gener. Chi-Square / DF | 1.19 |
So my question is, is square root transforming acceptable or no and why?
Many Thanks!
Seems like section could be appropriate to include....
Please describe the input data set. I can guess, based on your code, but knowing for sure would be preferable. I've been thinking that each observation in the data set was a leaf, with integer counts and continuous leafarea and integer previous_para, and that counts would suggest a Poisson (or negative binomial) distribution, being both integer and presumably small-ish. But now I'm thinking that each observation is compiled over multiple leaves (i.e., 30), with mean counts and mean leafarea (and mean previous_para?). (Digression: are you sure that you want to compute means by previous_para? I would think you would want the mean of previous_para, to match with counts and leafarea. And doesn't your second proc means compute means over n=1, and hence is unnecessary?)
Theoretically, the Poisson distribution is appropriate for integer values. Your mean counts are not integers, but GLIMMIX will run with them anyway. (Some R packages will not.) As the Poisson mean increases, the distribution can be approximated by the normal distribution; your means are large, at least in the small table you've included, so you could ponder trying a model that assumes normality.
In proc MIXED, you have RANDOM and REPEATED statements. In proc GLIMMIX you have RANDOM and RANDOM/RESIDUAL statements, correspondingly. A model always has an R matrix; it may (if it is a mixed model) have a G matrix, or it may not.
The structure of the data set needs to be reflected in the RANDOM statements in GLIMMIX, so it's critical that the person that suggests code (i.e., I) get the data set structure right. For example, if my supposition about the actual data set structure is correct, then I would add RESIDUAL to the second RANDOM statement in my suggested code (or to the suggested RANDOM statement for temporal autocorrelation).
I'm surprised that removing all RANDOM statements results in non-convergence. But I'd have to see the actual code to have any insight. Regardless, your experimental design requires a mixed model so it could be a moot point.
There is always something more to learn about mixed models, particularly generalized. And the goalposts keep moving.
If I were to answer only your last question, I would say, No. But I think there are many additional considerations.
Because the variance is a function of the mean for the Poisson distribution (specifically, the variance is equal to the mean), there may be no need for "variance stabilization" as in models assuming a normal distribution; the variance will "naturally" increase as the mean increases. However, it is possible that overdispersion exists for a particular data set in a particular model; there are different ways to deal with overdispersion, including using the negative binomial distribution rather than the Poisson or fitting a quasi-Poisson using random _residual_.
I doubt that your current model is correct. We first need clarification on the design: did you measure count on the same leaf in each of the 13 intervals, or did you measure count on different leaves in different intervals? I presume leaves are nested within sites, but are leaves also clustered by plants (which are nested within sites) or subject to some other design constraint?
Other thoughts:
To achieve count per unit leaf area, you need to use the ln(leafarea) as the offset, not plain leafarea.
When you incorporate previous_para as a simple covariate in the model, you are assuming that previous_para is not affected by either climate or season. That seems unlikely, and if so it is a nontrivial modeling problem. You may want to consider a structural equation model.
"subject=climate" in RANDOM statements is wrong if climate is a fixed effects factor.
The assignment of 13 (consecutive?) intervals to 3 seasons could be arbitrary and hard to justify. Or it could be fine, as long as you can adequately justify how and why you categorized what might otherwise be seen as a continuous-scale variable. You may want to consider regressing on interval and dropping season.
You may want to consider including interactions about climate, season, and interval(season).
Many of the problems that I see your model specification are not specific to using a Poisson distribution rather than a normal distribution. So I recommend careful study of SAS® for Mixed Models, Second Edition which deals largely with normal error models but is still a valuable resource for you. Other model considerations are specific to GL(M)Ms, for which Generalized Linear Mixed Models: Modern Concepts, Methods and Applications is an excellent (if dense) resource. It's not only the syntax that is difficult--the concepts can be as well 🙂
I hope this helps.
This is my first time really dealing with complex mixed models so I appreciate the feedback. It's taken me 100s of hours of reading to get to this point, but I know the model still isn't perfect.
Just to add/answer some of your comments.
Yes right now I am using random _residual_ in the model, but it still seems rather overdispersed. It goes from X2/df of 90>32 when that term is included, but can only ever get close to 1 when I square root transform the data. I've tried a negbin model, but have convergence issues.
30 new leaves are sampled from plants every interval. Site=plant in my case because I only have one study plant per site.. I was taking leaves from three separate sections of the canopy, but we decided to eliminate that from the model due to a list of issues including sections not being true replicates..
1) Yes I am using the log of leaf area.
2) Ive never heard of a structural equation model, so I will do some research into it.
3) Yes I will remove subject=climate. I originally had it as "random interval / subject=site residual" The intent was to treat interval as a repeated measure of site. I think that way was correct?
4) There is a lot more going on here with seasons I didn't really get into. I showed climate types vary by temperature and so do intervals, so if I see seasonal/climate differences they can in part be attributed to temperature differences. I can't directly model temperature because temperature is confounded by site due to a single temperature point corresponding to a site. Intervals are assigned to seasons based on mean monthly temperatures (0-15 degrees, 15.01-20, and 20.01-25). Assigning intervals to arbitrary seasons such as Winter Fall Spring Summer doesn't help us interpret seasonal effects in this case, and southern CA doesn't have traditional seasons.
5) I got rid of a climate*season interaction a while ago because p-values are huge for it. I'll have to consider an interval interaction, but I think with 26 intervals it would be difficult to interpret.
Yes right now I am using random _residual_ in the model, but it still seems rather overdispersed. It goes from X2/df of 90>32 when that term is included, but can only ever get close to 1 when I square root transform the data. I've tried a negbin model, but have convergence issues.
For various problems, including non-convergence, read Tips and Strategies for Mixed Modeling with SAS/STAT® Procedures and Advanced Techniques for Fitting Mixed Models Using SAS/STAT® Software. But first, you'll want to be sure you are using the correct syntax, there is scant point in worrying about non-convergence in a wrong model.
30 new leaves are sampled from plants every interval. Site=plant in my case because I only have one study plant per site.. I was taking leaves from three separate sections of the canopy, but we decided to eliminate that from the model due to a list of issues including sections not being true replicates..
I would think that you could include section as a fixed effects factor, although it is true that levels of section cannot be randomly assigned to the 3 plant "divisions". (We generally don't randomly assign gender to animals either, and yet we still use gender as a predictor variable.) But perhaps you are thinking of section as a random effects factor, in which case section is a subsample, not a true replicate, as are individual leaves.
Do you have 10 leaves from each of the 3 sections? If not, and if section matters, then you could have a bias problem.
2) Ive never heard of a structural equation model, so I will do some research into it.
For a biology-oriented introduction see Ch 8 by Grace, Scheiner, and Schoolmaster in Ecological Statistics Contemporary theory and application.
3) Yes I will remove subject=climate. I originally had it as "random interval / subject=site residual" The intent was to treat interval as a repeated measure of site. I think that way was correct?
Yes, interval is a factor associated with repeated measurements on sites. That would be closer to correct, but you are not using the residual option correctly. Only RANDOM statements that specify elements of the R matrix should include the residual option; RANDOM statements that specify elements of the G matrix should not. Site and repeated measures on sites are specified in G; leaves are specified in R.
4) There is a lot more going on here with seasons I didn't really get into. I showed climate types vary by temperature and so do intervals, so if I see seasonal/climate differences they can in part be attributed to temperature differences. I can't directly model temperature because temperature is confounded by site due to a single temperature point corresponding to a site. Intervals are assigned to seasons based on mean monthly temperatures (0-15 degrees, 15.01-20, and 20.01-25). Assigning intervals to arbitrary seasons such as Winter Fall Spring Summer doesn't help us interpret seasonal effects in this case, and southern CA doesn't have traditional seasons.
If temperature is a site-level fixed effects factor, you can model it (that's what mixed models are good at) as a regression. If climate is just a categorized version of temperature, then I would consider using temperature instead. But climate as a factor may be a more complex concept, involving something more than just temperature.
I do not have a full understanding of how and why you are defining "season"; it still strikes me as an arbitrary categorization. But moving forward, consider a model with only interval (omitting season) for the moment because it is simpler than a model with intervals nested within seasons:
proc glimmix data=final;
class site climate interval;
ln_leafarea = log(leafarea);
model counts = climate interval climate*interval
/ offset = ln_leafarea dist=negbin;
random intercept / subject=site(climate);
random interval / subject=site(climate);
run;
You could explore different forms of temporal autocorrelation among the repeated measurements by replacing the two RANDOM statements above with something like
random interval / subject=site(climate) type=<whatever>;
As you note, interpretation of a factor with 26 levels is a challenge. Rather than incorporate interval as a CLASS factor, you might regress on interval, possibly with a curvilinear model, possibly with a spline model (see https://blogs.sas.com/content/iml/2017/04/19/restricted-cubic-splines-sas.html) or other smoother (see https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_glimmix_sec...
Lots of different ways to envision the analysis. The challenge is knowing what possibilities exist, identifying the ones that validly represent the structure of the experiment, and then selecting a "best" among those while maintaining objectivity. (Frank Harrell has written, "Using the data to guide the data analysis is almost as dangerous as not doing so.")
Is previous_para the value for counts in the previous interval? Or does counts measure one insect species and previous_para a different insect species?
I hope this helps. If you have access to a statistician at your institution, take advantage of that opportunity.
Thanks a lot for your in depth reply and suggestions.
Sections are determined by canopy height, 0-33%, 34-66%, and 67-100%. I then took 10 leaves randomly selected from each section based on a direction (0-360 degrees). This is part of my dissertation and my PI wasn't too keen on including section, but I feel a lot of the variation in counts can be captured by section because the insect likes the shaded parts of the plants and densities decline as canopy height increases. I've been using averages of 7020 observations (30 leaves x 9 sites x 26 intervals) for counts across the study period. This breaks down into 234 means that go into the model and looks like this (subset) for each interval.
proc sort;
by interval season climate site logarea previous_para;
proc means mean n noprint;
var total logarea;
by interval season climate site previous_para;
output out=new mean=mnqtot mnarea n=n;
run;
proc sort data=new;
by interval season climate site previous_para;
run;
proc means data=new mean n noprint;
var mnqtot mnarea;
by interval season climate site previous_para;
output out=final mean=totalfin areafin n=n;
run;
proc sort data=final;
by interval season climate site previous_para;
run;
proc print data=final;
run;
Obs |
interval |
season |
climate |
site |
previous_para |
totalfin |
areafin |
n |
10 |
2 |
Warm |
Coastal |
1 |
0.660 |
10.67 |
3.09633 |
1 |
11 |
2 |
Warm |
Coastal |
2 |
0.912 |
306.97 |
4.24529 |
1 |
12 |
2 |
Warm |
Coastal |
3 |
0.495 |
514.60 |
3.05256 |
1 |
13 |
2 |
Warm |
Coastal |
4 |
0.722 |
352.17 |
2.79876 |
1 |
14 |
2 |
Warm |
Coastal |
5 |
0.426 |
506.83 |
3.52447 |
1 |
15 |
2 |
Warm |
Inland |
8 |
0.648 |
139.67 |
3.43376 |
1 |
16 |
2 |
Warm |
Inland |
9 |
0.801 |
73.43 |
2.91858 |
1 |
17 |
2 |
Warm |
Interm |
6 |
0.680 |
214.57 |
3.32450 |
1 |
18 |
2 |
Warm |
Interm |
7 |
0.429 |
87.10 |
2.69907 |
1 |
@sld wrote:
Yes, interval is a factor associated with repeated measurements on sites. That would be closer to correct, but you are not using the residual option correctly. Only RANDOM statements that specify elements of the R matrix should include the residual option; RANDOM statements that specify elements of the G matrix should not. Site and repeated measures on sites are specified in G; leaves are specified in R.
This point definitely may be the issue. I was interpreting the random statement in PROC GLIMMIX to be equivalent to the repeated statement in PROC MIXED, so anytime you had a repeated measurement it had to be included.
For example on page 9 of the advanced techniques for fitting Mixed models you cited, it says..
"Or, suppose you have the following REPEATED statement in PROC MIXED with a repeated effect of Time:
repeated Time / subject=Block type=ar(1);
Removing the random statement prevents even the most simplified of this model from converging.. So you are saying if I'm not using leaf counts, but site averages, I have no R-sided effects with my current model?
Yes climate designations aren't just temperature. They are based off of southern California climate zones. Coastal (USDA Hardiness Zone 10b; Sunset Climate Zone 24), intermediate (USDA Hardiness Zone 10a; Sunset Climate Zones 20–22), and inland (USDA Hardiness Zones 9a/9b; Sunset Climate Zones 18–19).
Counts are of the pest insect (whitefly). Previous parasitism is total parasitism of the whitefly by 3 parasitic wasps observed from the previous interval that could be affecting its densities at the subsequent interval.
It looks like I still have a lot of reading to do on adding regressions/splines to a mixed model.
Thanks!
Seems like section could be appropriate to include....
Please describe the input data set. I can guess, based on your code, but knowing for sure would be preferable. I've been thinking that each observation in the data set was a leaf, with integer counts and continuous leafarea and integer previous_para, and that counts would suggest a Poisson (or negative binomial) distribution, being both integer and presumably small-ish. But now I'm thinking that each observation is compiled over multiple leaves (i.e., 30), with mean counts and mean leafarea (and mean previous_para?). (Digression: are you sure that you want to compute means by previous_para? I would think you would want the mean of previous_para, to match with counts and leafarea. And doesn't your second proc means compute means over n=1, and hence is unnecessary?)
Theoretically, the Poisson distribution is appropriate for integer values. Your mean counts are not integers, but GLIMMIX will run with them anyway. (Some R packages will not.) As the Poisson mean increases, the distribution can be approximated by the normal distribution; your means are large, at least in the small table you've included, so you could ponder trying a model that assumes normality.
In proc MIXED, you have RANDOM and REPEATED statements. In proc GLIMMIX you have RANDOM and RANDOM/RESIDUAL statements, correspondingly. A model always has an R matrix; it may (if it is a mixed model) have a G matrix, or it may not.
The structure of the data set needs to be reflected in the RANDOM statements in GLIMMIX, so it's critical that the person that suggests code (i.e., I) get the data set structure right. For example, if my supposition about the actual data set structure is correct, then I would add RESIDUAL to the second RANDOM statement in my suggested code (or to the suggested RANDOM statement for temporal autocorrelation).
I'm surprised that removing all RANDOM statements results in non-convergence. But I'd have to see the actual code to have any insight. Regardless, your experimental design requires a mixed model so it could be a moot point.
There is always something more to learn about mixed models, particularly generalized. And the goalposts keep moving.
Here is a sample of the input data for site 9 (which is an inland climate type) at interval 26. The log of leaf area is taken after the datalines step.
input site leaf climate$ interval count leafarea previous_para;
if interval = 1 then season="Warm";
else if interval =2 then season="Warm";
else if interval =3 then season="Warm";
else if interval =4 then season="Warm";
else if interval =5 then season="Warm";
else if interval =6 then season="Warm";
else if interval =7 then season="Warm";
else if interval =8 then season="Mod";
else if interval =9 then season="Cool";
else if interval =10 then season="Cool";
ect. ect.
---------------------------------------------------------
9 6992 Inland 26 0 10.767 0.708
9 6993 Inland 26 14 8.985 0.708
9 6994 Inland 26 26 10.001 0.708
9 6995 Inland 26 0 23.441 0.708
9 6996 Inland 26 67 24.972 0.708
9 6997 Inland 26 5 20.839 0.708
9 6998 Inland 26 23 19.180 0.708
9 6999 Inland 26 27 17.171 0.708
9 7000 Inland 26 96 28.089 0.708
9 7001 Inland 26 0 11.887 0.708
9 7002 Inland 26 63 18.145 0.708
9 7003 Inland 26 0 26.068 0.708
9 7004 Inland 26 0 26.864 0.708
9 7005 Inland 26 0 13.516 0.708
9 7006 Inland 26 0 15.965 0.708
9 7007 Inland 26 0 14.251 0.708
9 7008 Inland 26 26 15.475 0.708
9 7009 Inland 26 0 25.854 0.708
9 7010 Inland 26 57 42.570 0.708
9 7011 Inland 26 8 19.700 0.708
9 7012 Inland 26 35 29.558 0.708
9 7013 Inland 26 54 28.970 0.708
9 7014 Inland 26 9 20.282 0.708
9 7015 Inland 26 13 17.331 0.708
9 7016 Inland 26 42 17.435 0.708
9 7017 Inland 26 3 20.557 0.708
9 7018 Inland 26 0 27.391 0.708
9 7019 Inland 26 0 29.681 0.708
9 7020 Inland 26 0 24.170 0.708
The mean previous parasitism for each site is already calculated at a specific subsequent interval (in this case 70.8%, from site 9 during interval 25) and included in the data input so there is no need to average it again in proc means. Yes the second proc means in unnecessary.
Proc Means would spit out this observation for that specific site/interval to run in the model.
Obs |
interval |
season |
climate |
site |
previous_para |
totalfin |
areafin |
232 |
26 |
Warm |
Inland |
9 |
0.708 |
18.93 |
2.95681 |
This is the current code that works.
proc glimmix data=final ;
class site interval season climate;
model totalfin = interval(season) season climate / offset=areafin dist=poi link=log solution;
random intercept / subject=site(climate);
random interval(season) / subject=site(climate) residual;
random _residual_;
lsmeans season climate / cl pdiff plot=meanplot adjust=tukey;
run;
P.S. I've tried stats counseling on campus, but all they ever provide me is stats Ph.D. students who I've found don't have the expertise yet to really solve these difficult questions, especially when programming is involved. So this is more progress than I've been able to achieve in the last month 🙂
If you want to use the means over leaves in the model fitting process, then I would take it a step further: compute density = totalfin/areafin, make density the response, drop the offset, and assume a normal distribution (possibly after a log transformation).
The point of the offset is to obtain density with count data. We use the offset because we can't use density per se with a Poisson distribution (which is a distribution for integer counts). Except that GLIMMIX does, in fact, allow us to have a non-integer response with a Poisson distribution. Hmm. So it runs, and it may work OK, or it may not, but it does not sit well with me. I would use leaf-level observations with a Poisson/negative binomial distribution, or means-over-leaves-level observations with a normal distribution. You would expect that, if both analyses ran well and correctly, you would get the same story from each.
The advantage of the means-over-leaves-level observations with a normal distribution approach is that the design structure of the model is simpler (there's no hierarchical level for leaves) and that the normal distribution generally behaves better (with respect to convergence, etc.). If the number of leaves is very unbalanced, then I'd prefer the leaf-level with Poisson/NB approach.
I believe in across-the-table statistical consultation, so I do like to make that suggestion. But I also know that what you really need is an applied statistician with expertise in your general field of study (for example, biology rather than sociology). Consulting centers staffed by stat grad students are excellent for the stat grad students, often not so great for the clients; it depends on the extent to which stat faculty are involved when the grad student has hit his/her knowledge limit. At my institution, there is a staff statistician in biological sciences, a staff statistician in education, and a staff statistician in agricultural sciences, supported by those administrative units and serving clients from only those units, not the institution as a whole. Researchers in those units are very happy.
If you are at a land grant institution with an agricultural experiment station, see whether the AES has a statistical staff (many do) and whether they might be able to help.
I hope this continues to help.
Yeah I was originally not using an offset but using densities as my dependent variable. I will try the normal distribution after seeing if I can normalize the data. I think I was fixated too much on using the poisson even though my data aren't integers.
Thanks again for being so generous with your time, I think this information is more than enough for me to troubleshoot my issues. I'm currently at a smaller UC school but I'll be moving to Cornell to do a post-doc in September, so if I'm still having issues hopefully the consulting there is superior to where I currently am.
Great! And congratulations on your post-doc.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.