Dear Brain trust,
I am submitting you a challenge I am trying to solve for analyzing my data.
I collected data from several dairy farms where I get some farm specific parameters during summer and winter (so repeated measurements). In every farm I went to I focused on the calves (up to 12) where I count the number of calves with a specific event (cons3). So my dataset is in the format:
farm/season/Cov1_farm/..../Covn_farm /n_cons/total_calves
I want to assess the association between Cov1_farm.... and the number of cons calves (n_cons) vs total number of tested calves (total_calves).
I first used a logit link in Glimmix with farm
proc glimmix data=dairy noitprint noclprint ic=q chol or method=quad (qpoints=7) ;
class farm season Cov1_farm;
model n_cons/Total_calves = season Cov1_farm / cl solution link=logit dist=binomial chisq oddsratio;
random intercept / subject=farm ;
run;
The question I had from my collaborators were discordant. The first one told me that this model would be a model describing what happens at the calf and not at the farm level since it modelizes the proportion of calves with Cons using farm level covariates. The second one said that this model was correct for investigating specific herd factors (so the model was at the farm level). Both told me that Poisson or Binomial negative models would also be suitable (count of abnormal calves of n calves at risk) for looking to these herd specific factors taking into account that farms were repeated at 2 different seasons.
I would be very grateful if you had any advice on the following questions:
1) Am I correct to say that the presented event-trial statement using logistic regression allows me to assess herd and not patients specific associated risk?
2) How to fit a Poisson or binomial negative model in clustered data (farm repeated twice) using glimmix or genmod
I suppose that I then could look at fit measures to determine which model (logistic vs Poisson or NB) I should chose
Thanks a lot.
Seb
Many thanks for all your support as well as for the reference you've submitted to me.
I think POISSON and NEGBIN are not appropriate for your data since they would e.g. assign positive probabilities (likelihood) to estimates where n_cons > total_calves.
I agree with @PGStats: neither Poisson or negative binomial appear to be appropriate. I'd say binomial, which is what your model specification would use by default.
I can't say whether your model is correctly specified without knowing (1) whether you observed the same calves in both seasons and (2) whether you are thinking of FARM as a random effects factor (i.e., in which case farm is the fundamental replicate) or as a fixed effects factor (in which case, calf is the replicate). (I'm fond of the paper by Bennington and Thayne in Ecology 1994 on this topic.) I'm guessing that FARM is random (that you are thinking of your farms as a random sample of a larger statistical population of farms to which you would like to make inference), and I'm guessing that you have different calves in different seasons. If so, then I would consider
model n_cons/Total_calves = season Cov1_farm / cl solution link=logit dist=binomial chisq oddsratio;
random intercept season / subject=farm ;
This model is a "model at the farm level" because it clusters (or nests) calves within farms and considers farms to be replicates.
Because you use the name "Cov1_farm", I suspect that Cov1_farm might be measured on a continuous scale, in which case it should not be included in the CLASS statement; then you should assess whether there is a linear relationship (or at least monotonic) between the logit of the proportion (n_cons/Total_calves) and Cov1_farm because logistic regression is assumed to be linear on the logit scale (lots of jargon, sorry). If Cov1_farm is measured on a categorical scale, then it should be in the CLASS statement.
Because calves are clustered within farms, you will need to determine whether overdispersion is a problem.
This is a generalized linear mixed model (GLMM), and there are lots of important details to be considered. It's not a trivial undertaking; if you can find an applied statistician at your institution, that would be fabulously useful and efficient. I'm a huge fan of Walt Stroup's text, Generalized Linear Mixed Models (CRC Press, 2012), but it is a dense resource for people who are just starting with GLMMs and are trying to self-learn.
Thank you very much for your useful comments and references and sorry for the lack of details.
In fact I did measured calves (0-2month old) the same farm at two different seasons (winter or summer). I did not measured the same calves (only calves from 0-2month old). I count calves that had the event of interest (n_cons) vs total number of 0-2month calves assessed at the day of the visit (Total_calves).
we randomly selected farms from a larger population of available farms.
the covariates I want to fit in the model are all categorical (so it is why I put them in the Class statement).
So the question I am trying to answer is: what are the farm-level covariates associated with farm-level proportion of event in 0-2month calves.
Thank you so much!
Thinking a bit more about the model...
There are two random statements that could be considered.
The simpler is your original statement
random int / subject=farm;
The more complex that I suggested
random int season / subject=farm;
includes what is sometimes referred to as a observation-level random effect and is one way to address overdispersion in the binomial model.
You might find this paper useful:
https://dl.sciencesocieties.org/publications/aj/pdfs/107/2/811
I think you are on the right track. Good luck!
Many thanks for all your support as well as for the reference you've submitted to me.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.