06-03-2014 04:23 PM
I have mentioned my data elsewhere in this forum (5 waves of data with a 2x2 [relatively balanced] intervention). I am continuing to have issues with elements of PROC GLIMMIX; in particular, with interpreting the LSMEANS. I have read a lot about how LSMEANS are calculated and think I get it. However, in one particular model, the LSMEANS are extremely far from (and not following the patterns of) the actual means. The LSMEANS actually look quite a bit like what we would *like* to see, but their difference from the raw data makes me very anxious.
I feel like the cause of the difference must have something to do with the balancing over time (treated as a CLASS variable). But I don't really understand what is going on. We have more data at the first time point than at any follow-up (as would be expected in an intervention with people). Additionally, because this outcome only applies to some participants, different participants may have data at some time points and not others (but there are 562+ participants at each time point, with over 900 providing some data).
If I run this syntax:
PROC GLIMMIX data=XXX METHOD=laplace EMPIRICAL=mbn;
MODEL dv=time / SOLUTION DIST=negbin CHISQ DDFM=BW;
RANDOM time / SUBJECT=subject TYPE=ARH(1);
LSMEANS time / ILINK;
The LSMEANS for time (transformed with the ILINK) are 13.9, 10.9, 10.9, 10.6, and 10.4. The actual means are 17.9, 17.8, 19.8, 18.4, 16.4. Although there are no differences in the raw means tested with any other sort of repeated measures analysis, the 'time' factor in this PROC GLIMMIX model is significant.
Obviously my actual model is more complex and includes the treatment variables and predictors, but the problem with means seems to emerge even at this very basic level. Something about my model must be incorrect, yes?
Thank you for any advice!
06-04-2014 08:56 AM
No reason to expect that the least squares means should equal the raw means, and especially true when you are dealing with a non-Gaussian distribution. The purpose of the canonical link (which for the negative binomial is the log) is to 'linearize' values so that mixed model methodology can be applied. When the estimates are put back on the original scale, they are not going to be (as) biased as the raw means.
06-04-2014 09:25 AM
Is there a way to explain this linearization process to someone interested in the content area rather than statistics? If the LSMEANS are lower at follow-up points than at baseline, does that mean the behavior of interest has changed, even if the raw means seem consistent across time? Does the discrepancy likely relate to the 'extreme' (right tail) observations in the negative binomial distribution? Is there anything about my model that I would want to double-check before reporting the significant time factor as evidence of behavior change over time?
Thanks again for all your help! I have been working with these models for months and months, and every time I think I have a final set there is one weird thing that holds me up.
06-04-2014 10:18 AM
Hi Steve - One more note related to this analysis. You had advised me to avoid the R side random effects with the negative binomial distribution. In the old model (the same except for with the repeated measure specified through R side instead of G side REs, which meant using PL), the LSMEANS were 17.2, 16.7, 18.5, 16.8, and 15.8 -- quite similar to the raw means. Time was not significant. That's one reason why this new model threw me.
06-04-2014 10:22 AM
This would be the difference between conditional and marginal means, then. See Stroup's Generalized Linear Mixed Models for a detailed discussion. He makes the point that the G side conditional estimates are less apt to be biased than the R side marginal estimates.