I am a grad student and using GLIMMIX for the first time (I am normally a JMP girl). I have spent a lot of time reading over the GLIMMIX procedure, SAS for mixed models and this forum but I was wondering if someone would be willing to take a look at my model to make sure that I am asking what I think I am asking?
proc glimmix data=batactivity
Class period date;
Model passespernight = trees_plot period canopy sb_killed Lake Urban2 period*canopy Lake*Urban2 Lake*Urban2 / dist = negbin solution
random date / sub=period type= ARH(1);
random _residual_ ;
I sampled bat activity (overdispersed negative binomial count data) in several bouts (date in my model), in many different sites (not repeated measures) over the summer and had from 1-5 detectors out at any one time. I have specific hypothesis that relate to how bat activity is likely to differ during the early part of the summer vs. late (period in my model). Date is random but period is not. Date should be nested within period. I specified an ARH(1) covariance structure because there is unequal time between sampling bouts and I expect that bat activity from sampling bouts that occur closer together are likely to be more correlated. I have also included some other covariates of interest.
Would it be appropriate to specify a covariance structure for the _residuals_ statement?
Any assistance would be most appreciated; also if anyone could suggest any other reading resources on the topic I would be grateful.
Was each site sampled more than once? For example, was each site sampled in both periods, with multiple dates in each period? If so, you'll likely need a variable identifying site in your code, so that site can be used as a subject for repeated measures.
Or was each site sampled on only one date? Or on multiple dates but only in one period?
When you say that you had 1-5 detectors out at any one time, does that mean that there were there 1-5 detectors at a particular site in a particular period on a particular date? If so, what is your intent with respect to the multiple detector values? For example, are you interested in estimating variability among the detectors, or might you use an average count computed over the detectors to represent bat activity at that particular site in that particular period on that particular date?
Or does it mean that you had 1-5 activity counts at 1-5 sites on any particular date?
The ARH(1) covariance structure only makes sense when repeated measures in time are evenly spaced. You can use SP(POW)(time) as an alternative to AR(1) for unequally-spaced levels of time. There is no heterogenous SP analogue for ARH(1), although if one*really* knew what one was doing, one might be able to accomplish a heterogeneous SP(POW) with some clever use of multiple RANDOM statements. I've never attempted that!
Typically (perhaps always), the subject in a RANDOM statement is a random-effects factor (it's a design element, not a treatment element); you probably want something related to site as the subject.
But a focus on covariance structure and the syntax for RANDOM statements is premature. What is your intent with respect to DATE? Intent would depend on how DATE relates to site: whether you have one date at each site, or multiple dates within a period at each site.
Are covariates measured on sites, or do any covariates vary through time (between periods or among dates)?
Hi Susan, thanks for the reply!
In response to your questions:
Each site was only sampled once, so no repeated measures. I had one detector at each site, so I had 1-5 activity counts at 1-5 sites on any particular date. Covariates are measured at each site and do not vary with time.
I thought ARH(1) was the heterogeneous analog of AR(1)? I am certainly willing to give it a try with SP(POW) though. Here is my new model.
proc glimmix data=batactivity
Class period date;
Model passespernight = trees_plot period canopy sb_killed Lake Urban2 period*canopy Lake*Urban2 Lake*Urban2
/ dist = negbin solution ;
random intercept / type= sp(pow)(date2) sub=date(period) ;
random _residual_ ;
I put date (the random effect) as the subject as you suggested and tried different variations of the random statement including:
random intercept / type= sp(pow)(date2) sub=date;
random date(period) / type= sp(pow)(date2) sub=date ;
random date / type= sp(pow)(date2) sub=date;
random date / type= sp(pow)(date2) ;
random date(period) / type= sp(pow)(date2);
I get pretty much the same result. Modeling the covariance structure actually doesn’t seem to change my significance but Gener. Chi-Square/DF and AIC are better.
Hmm. Sites are nested within dates. Your fundamental question is whether bat activity differs between two periods. Now we have to decide which design element "truly" replicates period: site or date or a combination of both?
How far, geographically, apart are your sites? How far, temporally, apart are your dates?
Yes, there is also likely spatial autocorrelation as well. My sites are a minimum of 400 m apart and are all within a 20 km radius (not on a grid). I have tried modeling the spatial variability using the following statement:
random site / type= sp(pow)(x y);
the output says that the model converged but I don't get any associated p-values and the residuals look terrible.
My fundamental question is actually whether bat activity differs with sb_killed (# of trees killed by spruce beetles) but I also had a hypothesis that bats may select different habitats at different times of the year. However, neither hypothesis seems to be supported based upon my model. Oh, well that's the way it goes sometimes. As long as I can get a working model I would be happy. Ecology is messy.
Thanks again for all your help! I was feeling all alone with my stats.....
I missed the temporal part of the question....I have 26 sampling bouts which are not evenly spaced but it is safe to say that each bout is within 3 days of the end of a previous bout (I sampled each site for 2 days).
Regarding unequal spacing of factor levels and the ARH(1) covariance structure: The “heterogeneous” part means that a separate variance will be estimated for each level of the factor (e.g., a variance for each level of DATE). Given that you have 1-5 observations at each DATE, you don’t have enough information to support heterogeneous variance estimation. The heterogeneous part has nothing to do with the spacing of DATE levels. rho is the correlation among observations between levels 1 and 2, between 2 and 3, between 3 and 4, etc. Consequently, if the number of days between levels 1 and 2 is different than the number of days between levels 3 and 4, rho may be nonsensical; this is true for other structures as well, like AR(1) and TOEP. With unequal level spacing, other covariance structures like ANTE(1) may be preferable, although at the price of appreciably more parameter estimates. See Littell et al., SAS for Mixed Models, 2nd ed., p 176.
Not all time variables are repeated measures variables (where multiple observations are made on the same subject). In your study, as I understand it, DATE is not a repeated measures factor: there is no design element (identified as SUBJECT) on which multiple observations have been made at different DATEs. But you are right that DATEs may not be independent: bat counts may be temporally correlated, or they could be functionally independent (meaning that although theoretically correlated, no correlation is apparent in the data). This all suggests that the syntax for a RANDOM statement for DATE might look like
random date(period) / type= g gcorr ;
(notably, without SUBJECT= ). The G and GCORR show you what the estimated structure is; these matrices are usually large so I typically save them to a SAS dataset using ODS OUTPUT, and exclude them from the output window using ODS EXCLUDE. I would expect (and hope) that denominator df for the test of PERIOD would reflect the number of DATEs.
But wait!!! Is this the (or part of the) appropriate model for your study? This approach to DATE does not use sites as replicates of PERIOD. What do you think about that? Are sites replicates, or are they subsamples for a particular date? Are sites individual bat colonies? If so, do you have movement of bats between colonies? Or are detectors set up “in the woods” somewhere?
One approach that uses sites as replicates might be coded as
random site(date) / type= ;
This syntax pool variability among DATEs with variability among sites. A sensible choice of covariance structure could be tricky, since the covariance has both spatial and temporal components. (For an example of what is possible with “ingenuity,” see Dale McLerran’s recent post on SAS-L: http://listserv.uga.edu/cgi-bin/wa?A1=ind0905a&L=sas-l and find the “arh(1)” thread. Dale is a NLMIXED wizard.) I would expect that denominator df for the test of PERIOD would reflect the number of sites.
Another approach might be
random date / type=;
random site(date) / type=;
This syntax partitions variability between DATEs and sites. I’m not sure how this syntax would sort out with respect to a test of PERIOD; I’d like to see denominator df reflect the number of sites minus a few df for DATE. But it might make specifying temporal and spatial covariance structures easier.
I really recommend approaching this analysis problem by starting simple and building up, rather than diving directly into a complex model. First you need to resolve the “what’s a replicate” issue. I would look graphically at the relationship between bat activity and DATE to see whether dates represent a trend or just noise.
Then I would try an appropriate model in GLIMMIX using a normal distribution assumption (maybe with a square-root transformation) with PERIOD only (no covariates yet) and check residuals and such, before patiently pursuing the incorporation of covariates and alternative distributions like the negative binomial.
Regarding data distribution: what is the range of typical values for bat activity? Are counts small, moderate, or large? Are there lots of zeros? If counts are large enough, normal distribution models often work well (or well enough).
With respect to covariates, be sure that the assumption of a linear (or at least monotonic) relationship between the response (which will be on the link scale for a non-normal distribution) and each covariate is appropriate if you are specifying linear relationships in the MODEL statement. This is a biological system: bat activity might peak at an intermediate (i.e., optimal) level of a covariate; this would not be a monotonic, much less linear, relationship and your model would need to accommodate the “true” shape of the relationship.
The last thing I’d address is the covariance structure. My experience with ecological data analysis has been that it’s usually sufficient to use the simplest random-effects structure that generates the appropriate denominator df for tests. Conclusions based on the tests typically (but not always) are consistent for various refinements in the covariance structure. It’s tempting to invoke elaborate, elegant models, but when information is limited (as is common for ecological field data), parsimony is more practical.
Other thoughts: you have “Lake*Urban2” twice in the MODEL statement. You’re also missing several semicolons, so this code obviously is not what you’re actually running. Are Lake and Urban2 both continuous scale variables? What variable represents different habitats?
The analysis of ecological data is an exercise in shaving off enough of the square corners of the data-peg to get it through the round hole of statistical analysis while staying true to the biology. Sounds daunting, but it can be quite fun, especially if it’s your own data!