Some of the data is attached. Lambs were fed 1 of 6 treatment diets in individual pens.
Blood serum collected/analyzed on days 0, 14, 57.
Analysis was done by a machine (some serum variables look like count data, but are not).
Some of the serum variables have funky distributions (see below): ALT, TP, and a few others (stairs), AST (long tail), .
I'll be using GLIMMIX, but not sure how to appropriately handle these distributions.
There is no law that says that the explanatory variables need to be normally distributed, so you might be worrying prematurely.
Clearly, these are rounded data. As such they will never follow any continuous distribution. If you were to jitter the data and compute a KDE, you would probably see density estimates that look more like what you are expecting.
If you post the syntax for the model, we might be able to weigh in as to whether we think these data will present problems in the analysis.
There is no law that says that the explanatory variables need to be normally distributed, so you might be worrying prematurely.
Clearly, these are rounded data. As such they will never follow any continuous distribution. If you were to jitter the data and compute a KDE, you would probably see density estimates that look more like what you are expecting.
If you post the syntax for the model, we might be able to weigh in as to whether we think these data will present problems in the analysis.
PROC GLIMMIX;
CLASS TRT DAY ID;
MODEL x = TRT|DAY / DDFM=KR SOLUTION;
RANDOM DAY/SUBJECT=ID TYPE = CSH;
Contrast 'CNTL vs. others' TRT 5 -1 -1 -1 -1 -1;
Contrast 'CNTL vs. BLU' TRT 1 -1;
Contrast 'CNTL vs. ERC' TRT 1 0 -1;
Contrast 'CNTL vs. MESQ' TRT 1 0 0 -1;
Contrast 'CNTL vs. ONE' TRT 1 0 0 0 -1;
Contrast 'CNTL vs. RED' TRT 1 0 0 0 0 -1;
LSMEANS TRT|DAY / DIFF ADJUST=SIMULATE (REPORT SEED=121211) cl adjdfe=row SLICEDIFF=DAY;
RUN;QUIT;
Hate to say this on a discussion board, but I am thoroughly confused.
Each blood serum varibable (e.g., ALT, glucose, urea nitrogen) is a dependant variable.
I thought that if the variable didn't have a normal distribution of resuduals (Q-Q plots, etc.), then you had to try & fit distributions (in GLIMMIX) such as lognormal, Weibull, beta, gamma, etc...
Sorry, I did not realize that the variable were all dependent. But as you say, it is the distribution of the RESIDUALS that is important, not the distribution of the variables themselves. Unless you have a reason to suspect that the errors are non-nornal, you might
start out with DIST=NORMAL and see what happens. Some of the long tails you see might be fit by the explanatory variables.
When you run the regressions, add
plots=residualpanel
to the PROC GLMMIX statement. Your syntax looks similar to the example in the GLIMMIX documentation, so see the section "Diagnostic Plots."
Thanks, Rick. I'll read the info. in your link, to try and figure out the plots below.
I ran the plot as suggested and got the following. Thoughts?
1. Your residuals are very tiny ~1E-6, so this is almost a perfect fit.
2. Your residuals show a linear pattern, so there appears to be unexplained structure. Perhaps by another variable that is not in the model.
First, I would add the residual option to the random statement:
RANDOM DAY/SUBJECT=ID TYPE = CSH residual;
and see what happens. I suspect the model is overparameterized because it is trying to essentially estimate variances for the residual twice. Hence the very small residual variance that @Rick_SAS notes.
My experience has been that if the G matrix is not positive definite, you can see this sort of pattern in the plot of residual versus linear predictor.
Edit: Adding "residual" to the random statement is necessary if you are using a normal distribution. If the distribution is non-normal (other than lognormal), then I don't add "residual" because for distributions where the variance is a function of the mean, there are residuals, but there is no such thing as residual variance. Stroup (2013) Generalized Linear Mixed Models is a good resource on this topic.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.