Hello all,
I am an environmental science graduate student attempting to use PROC GLIMMIX per the suggestion of my statistics professor. I have a few general questions about using this model due to my pretty underdeveloped stats understanding.
Study Background: I vacuum sampled insects off of 10 walls covered in vines and 10 adjacent blank walls during three separate months last summer. Study mimics an RBCD. Single treatment factor has 2 levels-green and not green. Each site is treated as a block containing both a blank and a green wall. Insect abundance data from the walls follow a non normal distribution and lack equality of variance. I could solve this with a simple log transformation but was hoping to avoid losing the ability to discuss my original values. Thus the PROC GLIMMIX. An example of my code is shown below involving data from one month of sampling (repeated measures were not used in this study).
dataabundancevisit3;
inputblk trt$ y;
lines;
1g 5.867
1b 0
2g 6.933
2b 0.444
3g 0.8
3b 0
4g 5.2
4b 0.667
5g 56.267
5b 1.333
6g 14.933
6b 0
7g 54.133
7b 0.444
8g 5.026
8b 0
9g 4.8
9b 1.333
10g 11.733
10b 0
;
optionsnocenter;
procprint data=abundancevisit3;
run;
procglimmix data=abundancevisit3;
classtrt blk;
modely=trt/dist=poisson link=log;
randomblk;
lsmeanstrt/cl ilink;
run;
Questions
From output it looks as though my data, or at least my parameters, are being transformed. Does this limit my ability to talk about my original dataset? Or is my data left untransformed and only the parameters transformed? Can I use backtransformed means when discussing results though it is not the same as my original data means? I guess I just want a simplistic idea of what SAS is doing to my data vs parameters. What is being transformed exactly?
My other question involves assessment of the POISSON model fit. I have seen some codes using the pearson chi square/DF to determine whether their poisson distribution is over dispersed but my output only contains gen chai square/DF. I gather I could change this by inserting a MODEL= statement but have no real idea of what I would put there…any tips on what I need to examine to make sure this model is an OK fit for the analysis? Thank you so much in advance for any tips!!!
Serena
Did you sample a constant wall surface or did you have to standardize by the sampled area (hence the decimals)?
PG
As PGStats stated, the Poisson distribution is for counts. But your "counts" are decimal values. You must have converted the raw values to some unit scale (volume, etc.). One can do the analysis on these decimal values, but first you want to make sure you really are analyzing counts.
In the procedure code you supplied, you took the fairly common approach (and generally recommended approach) of modeling the log(ExpectedCount) as a linear function of the treatment factor. That is, you told the program to use the log link (which is actually the default for the Poisson). Note that in generalized linear (mixed) models, one models the log of the expected value, not the expected value of a log of the variable (as one would do with transformed Y and normal distributions). The statistical results are all on the log-link scale. That is appropriate. But you also gave the inverse link option; this means that you are also getting the mean counts and confidence intervals on the original scale (just to the right of the log scale results in the LSMEANS output table).
The default model fitting approach in GLIMMIX with random effects is known as pseudo-likelihood or pseudo-restricted-likelihood. This prevents you from easily evaluating goodness of fit. (I can't get into the technical details here, but in short, the model is being approximated, which prevents direct interpretation of the likelihood obtained). There are some work-arounds by saving certain results and doing some post-model-fitting analyses. For your model, however, there is a better and more direct approach: use a model fitting method that works without approximating the model. I suggest you use the Quadrature method (method=quad). You must write the random statement in a different way (but equivalent in terms of meaning and interpretation. That is, use
random int / sub=blk;
instead of
random blk;
.
Full code:
proc glimmix data=abundancevisit3 method=quad;
class trt blk;
model y = trt / dist=poisson link=log;
random int / sub=blk;
lsmeans trt / cl ilink;
run;
Now you get a proper Pearson chi-square statistic and df (and the ratio) to help in evaluating the model fit. For your example, the ratio is under 1; thus, no evidence (from this statistic) of a poor fit. The model is still log(Exp.Count) as a function of treatment. The ilink option gives the means and other statistics on the original scale.
Good luck.
LVM
Every so often I read a response and think, "Wow! What a great answer!" This is one of those times. Thanks, lvm!
My feeling, exactly, Rick. LVM, you're a true expert. Thanks.
PG
Hello,
I would like to get back at the question of the nature of the dependent variable, the counts. Scaled counts are not counts : they do not follow a Poisson distribution. I think it is preferable to use real counts and to model the sampling unit size (time, area, volume, whatever) explicitly with an offset. For example, in SMATT1's data, if y = count / area then LVM's analysis would become :
proc glimmix data=abundancevisit3 method=quad;
class trt blk;
LogArea = log(area);
model count = trt / dist=poisson link=log offset=LogArea;
random int / sub=blk;
lsmeans trt / cl ilink;
run;
I simulated different counts corresponding to a few areas that would result in SMATT1's data. The assessment of a reasonable Poisson fit (ChiSquare/DF) holds only to about an area of 5 (i.e. to counts = 5y). It thus seems important to use the true scale of the counts to assess the fit. The same holds for estimating the significance of effects.
PG
Good points made by PGStats. Use of offset is a good practice.
Thanks for the compliments. However, I just consider myself a serious user of linear, generalized, and nonlinear mixed models.
Thank you for these helpful comments!!
IVM: In summary Iwant to make sure I understand. “Modeling the log(ExpectedCount) as a linearfunction of the treatment factor” means that based on my data, glimmix generatesexpected values on a log scale. It thenuses these to determine if the treatment significantly explains differences in expectedOR actual counts OR both? I guess I did not realize that GLIMMIXgenerates expected data.
PG STATS: data containsdecimals because data was normalized by sample size. Ten 0.75m^2 quadrats were sampled at eachwall (10 subsamples). I divided theaverage abundance of insects per quadrat by 0.75m^2 to obtain average per m^2. From my understanding it seems like the 0.75m^2 value falls within that up to count=5y limit you discussed? I am a little hesitant to include the offsetmainly b/c I am not totally clear on what that is doing. If I use the non-normalized data why includearea at all? I am guessing the offsetsomehow normalizes for sample area within the model. Area is log’d b/c expected counts arepresented in log scale. Or something!
Thank you all, so informativeand helpful!
You are correct Serena, if your sample size was always the same, there is no need to include an offset in your analysis. Simply use raw counts instead of densities. Assess the fit of your three months observations that way. Maybe LVM can advise you on the robustness of the analysis to departures from the Poisson distribution. GLIMMIX offers you the negative binomial as an alternative distribution for counts.
PG
Regarding the general question on the model....Let mu equal the (conditional) expected count for a given value of treatment (trt; fixed effect) and block (blk; random effect). The combination of your model and random statements (with the options) is really specifying the following model.
ln(mu) = constant + trt + blk,
with mu = exp(ln(mu)),
y ~ Poisson(mu).
So yes, GLIMMIX is modeling the link of the expected value for the specified distribution. The LSMEANS statement is giving you ln(mu) for each trt level; because of the ilink option, you are also getting mu = exp(ln(mu)) for each treatment level. Based on the Pearson chi-square for your data, Poisson seems reasonable.
Based on your most recent post, you may not be getting the maximum use of your data. You indicated 10 samples for each wall (block) and treatment. You could actually be specifying all 10 observations for each trt and blk combination, using a label of 1-10. That is, each record would have a trt, blk, and sample code, together with y (data set with 10 times the number of records). A model could then be fit of the form:
proc glimmix data=abundancevisit3 method=quad;
class trt blk sample;
model y = trt / dist=poisson link=log;
random int sample / sub=blk;
lsmeans trt / cl ilink;
run;
But if my experimental unit is a wall, then the 10 quadrat samples taken on the wall are just subsamples of the EU and must be averaged...right? Otherwise aren't I introducing some sort of false replication?
Thanks again,
S
The new code I gave you is intended to properly treat the samples as true samples within the experimental units (not as indepdenent replicates). However, I wrote my code too quickly. I recommend:
proc glimmix data=abundancevisit3 method=quad;
class trt blk sample;
model y = trt / dist=poisson link=log;
random int trt / sub=blk;
lsmeans trt / cl ilink;
run;
This will handle the samples as samples, if you take this approach. Or, just deal with the simpler situation dealt with originally.
Thanks again
Last questions on this topic-I promise!
If I include subsamples does it matter that I have unequal replication of subsamples for the two different treatments? Blank walls did not require as many quadrats as vegetated thus there were 3 on the blank and 10 on the vegetated...(although I still had equal replication of experiemntal units). I am not sure if this would require modifications to the code at all.
And, IVM mentioned that Pearson Chi/DF less than 1 shows adequate fit of poisson- is there a lower limit of some sort? I wasn't sure if lack of fit was shown by departures from 1 or actually being greater than 1. Data from my second sampling visit yeilds a value of 0.17-not sure if this is too low.
Serena
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.