Hi Lea,
Regarding unequal spacing of factor levels and the ARH(1) covariance structure: The “heterogeneous” part means that a separate variance will be estimated for each level of the factor (e.g., a variance for each level of DATE). Given that you have 1-5 observations at each DATE, you don’t have enough information to support heterogeneous variance estimation. The heterogeneous part has nothing to do with the spacing of DATE levels. rho is the correlation among observations between levels 1 and 2, between 2 and 3, between 3 and 4, etc. Consequently, if the number of days between levels 1 and 2 is different than the number of days between levels 3 and 4, rho may be nonsensical; this is true for other structures as well, like AR(1) and TOEP. With unequal level spacing, other covariance structures like ANTE(1) may be preferable, although at the price of appreciably more parameter estimates. See Littell et al., SAS for Mixed Models, 2nd ed., p 176.
Not all time variables are repeated measures variables (where multiple observations are made on the same subject). In your study, as I understand it, DATE is not a repeated measures factor: there is no design element (identified as SUBJECT) on which multiple observations have been made at different DATEs. But you are right that DATEs may not be independent: bat counts may be temporally correlated, or they could be functionally independent (meaning that although theoretically correlated, no correlation is apparent in the data). This all suggests that the syntax for a RANDOM statement for DATE might look like
random date(period) / type=
g gcorr ;
(notably, without SUBJECT= ). The G and GCORR show you what the estimated structure is; these matrices are usually large so I typically save them to a SAS dataset using ODS OUTPUT, and exclude them from the output window using ODS EXCLUDE. I would expect (and hope) that denominator df for the test of PERIOD would reflect the number of DATEs.
But wait!!! Is this the (or part of the) appropriate model for your study? This approach to DATE does not use sites as replicates of PERIOD. What do you think about that? Are sites replicates, or are they subsamples for a particular date? Are sites individual bat colonies? If so, do you have movement of bats between colonies? Or are detectors set up “in the woods” somewhere?
One approach that uses sites as replicates might be coded as
random site(date) / type= ;
This syntax pool variability among DATEs with variability among sites. A sensible choice of covariance structure could be tricky, since the covariance has both spatial and temporal components. (For an example of what is possible with “ingenuity,” see Dale McLerran’s recent post on SAS-L: http://listserv.uga.edu/cgi-bin/wa?A1=ind0905a&L=sas-l and find the “arh(1)” thread. Dale is a NLMIXED wizard.) I would expect that denominator df for the test of PERIOD would reflect the number of sites.
Another approach might be
random date / type=;
random site(date) / type=;
This syntax partitions variability between DATEs and sites. I’m not sure how this syntax would sort out with respect to a test of PERIOD; I’d like to see denominator df reflect the number of sites minus a few df for DATE. But it might make specifying temporal and spatial covariance structures easier.
I really recommend approaching this analysis problem by starting simple and building up, rather than diving directly into a complex model. First you need to resolve the “what’s a replicate” issue. I would look graphically at the relationship between bat activity and DATE to see whether dates represent a trend or just noise.
Then I would try an appropriate model in GLIMMIX using a normal distribution assumption (maybe with a square-root transformation) with PERIOD only (no covariates yet) and check residuals and such, before patiently pursuing the incorporation of covariates and alternative distributions like the negative binomial.
Regarding data distribution: what is the range of typical values for bat activity? Are counts small, moderate, or large? Are there lots of zeros? If counts are large enough, normal distribution models often work well (or well enough).
With respect to covariates, be sure that the assumption of a linear (or at least monotonic) relationship between the response (which will be on the link scale for a non-normal distribution) and each covariate is appropriate if you are specifying linear relationships in the MODEL statement. This is a biological system: bat activity might peak at an intermediate (i.e., optimal) level of a covariate; this would not be a monotonic, much less linear, relationship and your model would need to accommodate the “true” shape of the relationship.
The last thing I’d address is the covariance structure. My experience with ecological data analysis has been that it’s usually sufficient to use the simplest random-effects structure that generates the appropriate denominator df for tests. Conclusions based on the tests typically (but not always) are consistent for various refinements in the covariance structure. It’s tempting to invoke elaborate, elegant models, but when information is limited (as is common for ecological field data), parsimony is more practical.
Other thoughts: you have “Lake*Urban2” twice in the MODEL statement. You’re also missing several semicolons, so this code obviously is not what you’re actually running. Are Lake and Urban2 both continuous scale variables? What variable represents different habitats?
The analysis of ecological data is an exercise in shaving off enough of the square corners of the data-peg to get it through the round hole of statistical analysis while staying true to the biology. Sounds daunting, but it can be quite fun, especially if it’s your own data!
Cheers,
Susan