I am analyzing count (behaviour data) data deriving from a study with cows of different genotype on different farms with repeated measures (on each farm on 3 consecutive days within periods - 3 periods for each of the farms, for each of which approx. 4 weeks lie in between the visits (=periods)).
For my interval data in this study I have build the following mixed model (my dependent variable are means of measurements within a certain time of the day e.g. from 8 to 10.00, 10-14.00, 14-17.00):
proc mixed data=xxxx;
class genotype animal farm day period;
model y = genotype period / solution;
repeated day(period) / subj=animal(farm) type=cs;
first question concerning covariance error structure:
within one period my timepoints are equally spaced, but as there are 4 weeks to subsequent three-days repetition(=period) on the same animals(farm) I seem to have a kind of mixture between equally spaced and unequally spaced timepoints. In Littel et al. - SAS for mixed models I have found type = cs is one covariance error structure which may be used for both equally and unequally spaced timepoints (UN is not running, and AR(1) does not seem to be the right choice (?), as the subjects (in my case the cows) of one genotype haven't really been randomly chosen to account for genotype confounding effects (such as lactation stage, age etc)).
Do you have any advice for appropriate covariance structure?
by the way, proc mixed also doesn't seem to understand the nesting of animal(farm) and day(period) in the repeated statement, independent of the codes for the variables day (numerical or date) and period.
second question concerning count data with zero-inflated distribution:
as I have many zeros in my count data, I think it is best to choose a zero-inflated poisson distribution to fit the data. But in proc glimmix it seems that there is no option for dist=zip or dist=zero-inflated negative binomial (?). Do you have any idea what to do in such a case ???
Maybe there's another solution for zero-inflated distribution and a hierarchical data set ?
I would appreciate any comment and idea, as for this kind of data structure the problems seem to exceed the possible statistical solutions so far...
what I mean by "proc mixed also doesn't seem to understand the nesting of animal(farm) and day(period) in the repeated statement" is that it produces exact the same results (fit criteria, estimates of fixed factor, SE and p-values) if you omit the nesting of day in period and/or animal(farm) in the repeated statement, and if change the code for day(1,2,3 ...9 days per subject = 9 repetitions per subject) or even change the period from 1,2,3 to 1,2,3,...9...
I am helpless in the face of this problem and hope that anyone may have a possible explanation for this.
Thanks very much in advance,
and how SAS will parameterize this statement. The parameterization is the same as:
repeated day*period/subject=animal*farm type=cs;
This makes sense to me that you would get the same results if you respecify as you mention. Try adding the rcorr option after the slash, and take a look at the correlations under your various approaches. They should be identical, as they are restatements of the same approach.
Some alternate approaches might be:
repeated day*period/subject=animal*farm type=csh
repeated day*period/subject=animal*farm type=sp(pow)(time)
where time is actual elapsed days post study start
repeated period day/subject=animal*farm type=UN@AR(1)
which would model an unstructured relationship between periods, but a common autoregressive relationship within periods.
Note also that for the latter two structures, you should include animal*farm in the random statement, so that between animal variability is correctly modeled.
thanks very much for your comments. I have been away for a few days, but my mind has still been engaged in statistics to some extent ....
> Aha. Now I understand part of what is going on.
> Think about the repeated statement:
> repeated day(period)/subject=animal(farm) type=cs;
> and how SAS will parameterize this statement. The
> parameterization is the same as:
> repeated day*period/subject=animal*farm type=cs;
> This makes sense to me that you would get the same
> results if you respecify as you mention. Try adding
> the rcorr option after the slash, and take a look at
> the correlations under your various approaches. They
> should be identical, as they are restatements of the
> same approach.
Your idea has raised some hope, but I do not understand, what the * means in the repeated statement?
I have tried the repeated statement
repeated day(period)/subject=animal(farm) type=cs
with my different codings and the day*period/subject=animal*farm type=cs parametrization.
1. day in date format = 27 different dates/levels, which should be the same code as 1,2,3 ...-27; period = 1,2,3;
2. day numerical = 1,2,3,....9 according to 9 repetitions for each animal, which are cut into 3 periods so that day 1,2,3 = period 1, day 4,5,6= period 2 etc ,
; period = 1,2,3;
3. instead of period 1,2,3 period 1 to 9 (3 visits=periods on each of the 3 farms),
though the latter doesn't make sense, since period is regarded as fixed effect in the model: periods may be regarded as 3 different stages of the whole observation period since there are more days between visit=period 1 and 2 and between visit=period 2 and 3 respectively for each of the 3 farms(farm A: 25 and 25 days respectively, farm B: 24 and 23 days respectively, farm C: 23 and 27 days respectively) than within period 1, 2 and 3 between the different farms - for example 5 days between period 1 on farm A and period 1 on farm B, or 2 days between period 2 on farm B and period 2 on farm C etc....this was due to feasibility reasons, since I was dependent on weather conditions for my observations and measurements].
Yes, correlations are identical for both approaches with specifying as mentioned above under 1. and 2.; and identical if I omit (nesting in) "(period)" or "(farm)" in the repeated statement;
both approaches are also the same if I specify as in 3.(1 to 9 periods=no repetition in periods) but with different correlations than in 1. and 2. (both with period 1,2,3).
But what does this mean?
Does it mean, that I don't have to tell SAS that day 1 in period 1 is not the same as day 1 in period 2 and 3, and day 1 on farm A is not the same as day 1 in farm B respectively
and that e.g. animal 7 just exists on farm A, not on farm B and C? so that I can omit (nesting of days in) "period" and (nesting of animals in) "farm" ?
Does SAS understand that day (1,2,3...to 9) nested in period is the same as day (date format 1 to 27) nested in period ?
Alternatively, I was also thinking about modeling my data with another approach
proc mixed data=xxxx;
class genotype animal farm day period;
model y = genotype period / solution;
random farm day(farm);
repeated period / subj=animal(farm) type=cs;
resulting in a smaller -2 Res Log-Likelihood and AIC... and a much smaller estimate for the residual covariance parameter, but still, it doesn't matter, if I omit "(farm)" in the repeated statement or random statement.
class origin farm animal period date;
model mean_rr1= origin period / solution;
random farm date(farm);
repeated period / subject=animal(date farm) type=cs;
which converged, but the hessian was not positive definite, and SAS notes that a linear combination of covariance parameters is confounded with the residual variance (meaniing?)
> Some alternate approaches might be:
> repeated day*period/subject=animal*farm
> repeated day*period/subject=animal*farm
> where time is actual elapsed days post study start
> or even:
> repeated period day/subject=animal*farm
> which would model an unstructured relationship
> between periods, but a common autoregressive
> relationship within periods.
Do you mean a doubly repeated structure with period day (without *)? if yes, does SAS again know that days are nested within period?
> Note also that for the latter two structures, you
> should include animal*farm in the random statement,
> so that between animal variability is correctly
As for covariance structures, do structures such as UN@AR(1) or sp(pow)(time) with animal*farm in the random statement assume animals(farm) or animals*farm to be randomly chosen (assigned to treatments respectively, which in my case are the different genotypes) ?? This would not be the case in my study, since the cows were chosen based on the same age and same stage of lactation, which are confounders if one wants to look for genotype effects.
Farm is already a random factor, does it make any difference if I include animal*farm additionally?
I think my questions reveal that I am no statistician ....nevertheless, I hope that they make sense somehow.
Do you have some more advice?
lots of thanks in advance,
Wow. This is a very information dense post, and I don't think I can address all of it, so let's try some pieces.
First, nesting in SAS. You need to look at the documentation for MIXED or GLM that covers this. The parameterization of A nested in B and A crossed with B is the same. That is why it makes no difference in the rcorr. Here is what the online manual says:
Nested effects are generated in the same manner as crossed effects. Hence, the design columns generated by the following two statements are the same (but the ordering of the columns is different):
model Y=A B(A);
model Y=A A*B;
The nesting operator in PROC MIXED is more a notational convenience than an operation distinct from crossing. Nested effects are typically characterized by the property that the nested variables never appear as main effects. The order of the variables within nesting parentheses is made to correspond to the order of these variables in the CLASS statement. The order of the columns is such that variables outside the parentheses index faster than those inside the parentheses, and the rightmost nested variables index faster than the leftmost variables (Table 56.18).
Table 56.18 Example of Nested Effects Data
(I deleted the table as it doesn't come across very well)
Note that nested effects are often distinguished from interaction effects by the implied randomization structure of the design. That is, they usually indicate random effects within a fixed-effects framework. The fact that random effects can be modeled directly in the RANDOM statement might make the specification of nested effects in the MODEL statement unnecessary.
The other part that I want to address is the doubly repeated measures. This assumes that the correlation of residuals amongst days is the same, no matter which period we have. I was thinking that observations on the close time points would be more highly related, and probably similar, so that an ar(1) or cs structure might fit. The separated time points (i.e., periods) would have a different relationship.
If you want to deal with all at once without those assumptions, you have unequal spacing and the spatial power method goes after that with the assumption that the errors are "constantly correlated" as in ar(1), but that the points used to estimate that correlation are not equally spaced.
A read of the TYPE= section of the PROC MIXED documentation might point you at the right kind of structure.
Just a short post in reply (comments between your lines):
> First, nesting in SAS. You need to look at the
> documentation for MIXED or GLM that covers this. The
> parameterization of A nested in B and A crossed with
> B is the same. That is why it makes no difference in
> the rcorr. Here is what the online manual says:
> Nested Effects
> Nested effects are generated in the same manner as
> crossed effects. Hence, the design columns generated
> by the following two statements are the same (but the
> ordering of the columns is different):
> model Y=A B(A);
> model Y=A A*B;
> The nesting operator in PROC MIXED is more a
> notational convenience than an operation distinct
> from crossing. Nested effects are typically
> characterized by the property that the nested
> variables never appear as main effects.
Hm, meaning must not appear as main effect ?
In my example the nested variable (within parentheses) period would appear in the model as main effect, since I am interested in the period effect.
Did I miss the full meaning of this?
> Note that nested effects are often distinguished from
> interaction effects by the implied randomization
> structure of the design. That is, they usually
> indicate random effects within a fixed-effects
> framework. The fact that random effects can be
> modeled directly in the RANDOM statement might make
> the specification of nested effects in the MODEL
> statement unnecessary.
Hm, I don't have any nested effect in the MODEL statement anyway, neither in the random statement, but in the repeated statement.
> The other part that I want to address is the doubly
> repeated measures. This assumes that the correlation
> of residuals amongst days is the same, no matter
> which period we have. I was thinking that
> observations on the close time points would be more
> highly related, and probably similar, so that an
> ar(1) or cs structure might fit. The separated time
> points (i.e., periods) would have a different
> If you want to deal with all at once without those
> assumptions, you have unequal spacing and the spatial
> power method goes after that with the assumption that
> the errors are "constantly correlated" as in ar(1),
> but that the points used to estimate that correlation
> are not equally spaced.
I am certainly willing to try your recommendations.
In Proc MIXED I'd prefer the former, and, if I get your point, to do that I have to model doubly repeated measures expressed as: repeated period day / subj=animal*farm type=UN@AR(1) with period in the first place instead of a nested repeated , right?
does the order of these variables make any difference for the type=cov structure option?
and in Proc GLIMMIX (for count data) maybe a spatial power method with sp(pow)(time) with repeated day*period / subj=animal*farm as the UN@AR(1) is not available.
Would you confirm or reject my statements?
Thanks a lot for your patient comments anyway, at least some of the points may be a bit clearer now.
I skipped the thing about animal*farm inclusion in the random statement when using covariance structures as sp(pow)(time) and UN@AR(1) for the repeated structure. Does it imply animals to be a random sample rather than a selected sample or should between animal variability be modeled in this way regardless of such design features/premises?
I'll try for both answers at once. First, the order is critical for the Kronecker product. The first term is always unstructured, while the second may be unstructured (UN), autoregressive (AR(1)), or compound symmetric (CS). Think about which is most appropriate. Your choices for structures seem logical to me.
The second point--the inclusion of the repeated subject in the random statement--for covariate structures with a correlation term (see the TYPE= tab in the documentation) comes from the paper "Statistical Analysis of Repeated Measures Data Using SAS Procedures" Littell et al. J. Anim. Sci. 1998. 76:1216–1231, and has to do with getting proper standard errors when using structures like AR(1) that involve estimating a correlation. Nothing about a random sample, really.
Message was edited by: SteveDenham
The covariance structure type=sp(pow) time for my glimmix model (repeated date*period/subj=animal(farm) and the subject effect additional to the random farm effect in the random statement) seems to work fine (smaller -2 Res LogL and AIC ....closer to zero, respectively, than with cs), the same for type=UN@AR(1) for my mixed model;
but it seems to me that my repeated structure is not correctly modeled as the dimensions table in my output(glimmix) tells me:
Subjects = 1, and max Obs per subject = 234 while I have 26 subjects and 9 observations (3 repetitions per period) per subject
and the model information table states that
variance matrix = not blocked;
Dimensions of G and R-side effects seem to be ok.
similarly, in my output for the mixed model with
the residual variance method = Parameter in the model information and
subject=1 , max obs per subject = 234 in the dimensions table;
So what's wrong with my model?
proc glimmix data=xxxx;
class genotype farm animal period date;
model x =genotype period genotype*period mean_bg / dist=negbin solution;
random farm animal(farm);
random date(period) / subject=animal(farm) type=sp(pow)(time) residual;
I use the univariate format (one row per experimental unit at each timepoint) as emphasized in Littell et al, J. Anim. Sci. 1998. 76:1216–1231 for proc mixed/repeated measurements
and other sources e.g. in http://www.sportsci.org/resource/stats/threetrials.html
I assume it's the same for proc glimmix.
In the SAS glimmix documentation (keyword: processing by subjects) I found the statement that "if a random statement does not have a subject=effect (as I do have in my model: random = farm), processing by subjects is not possible unless the random effect is a pure R-side overdispersion effect".
As far as I understand, processing by subject is a question of how glimmix mathematically processes data but not a question of correctly modeling covariance structures, right ? if not, do I have to choose a subject for farm ???
Anyone a clou about this ?
I also want to look at the significance of my random farm effect by omitting it from the model (in mixed as well as in glimmix) above, but SAS doesn't give a result in the output for the Likelihood Ratio Test.
Is the Likelihood Ratio Test not possible with other effects in the random statement (e.g. the subject effect required by UN@AR(1) and sp(pow)(time)) ?