- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I have a question about checking the residual normality for binomial data that is being analyzed with PROC GLIMMIX. I have 291 subjects with treatment and pen being the classes. The analysis is to determine if there was a statistically significant difference between treatments for subjects that graded AAA vs ones that didn't, therefore I've organized my data as a binomial distribution; if they graded AAA (1) vs if they didn't (0). Below is the code I am using to analyze the data.
proc glimmix;
class trt pen;
model aaa = trt / d=binomial link=log;
output out=resid pred=predicted residual=residual;
lsmeans trt / diff lines ilink;
contrast 'CTL vs Treatments' trt 4 -1 -1 -1 -1;
contrast 'CTL vs Low Straw' trt 2 0 0 -1 -1;
contrast 'CTL vs High Straw' trt 2 -1 -1 0 0;
contrast 'Low Straw vs High Straw' trt 0 1 1 -1 -1;
contrast 'Canola vs Flax' trt 0 1 -1 1 -1;
run;
However, I'm running into issues when I try to analyze the residuals. Primarily because in our stats class they didn't specify how to analyze residuals for binomial data and what the assumptions are for models that analyze binomial data. Are the residuals still supposed to be normally distributed? If not, how would you go about analyzing the residuals, would you still use PROC UNIVARIATE? I tried the following code (which is similar to the code I use to check residuals for my other linear regression model):
proc univariate plot data=resid;
var residual;
ods select Extremeobs plots;
run;
proc univariate data=resid normal plot;
var residual;
run;
However, my Shapiro-Wilk value comes out to <0.0001 and my residual plots come out looking not normal at all. I've attached my SAS syntax as a file to this. Any help about this would be appreciated.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I have a question about checking the residual normality for binomial data that is being analyzed with PROC GLIMMIX. I have 291 subjects with treatment and pen being the classes. The analysis is to determine if there was a statistically significant difference between treatments for subjects that graded AAA vs ones that didn't, therefore I've organized my data as a binomial distribution; if they graded AAA (1) vs if they didn't (0). Below is the code I am using to analyze the data.
proc glimmix;
class trt pen;
model aaa = trt / d=binomial link=log;
output out=resid pred=predicted residual=residual;
lsmeans trt / diff lines ilink;
contrast 'CTL vs Treatments' trt 4 -1 -1 -1 -1;
contrast 'CTL vs Low Straw' trt 2 0 0 -1 -1;
contrast 'CTL vs High Straw' trt 2 -1 -1 0 0;
contrast 'Low Straw vs High Straw' trt 0 1 1 -1 -1;
contrast 'Canola vs Flax' trt 0 1 -1 1 -1;
run;
However, I'm running into issues when I try to analyze the residuals. Primarily because in our stats class they didn't specify how to analyze residuals for binomial data and what the assumptions are for models that analyze binomial data. Are the residuals still supposed to be normally distributed? If not, how would you go about analyzing the residuals, would you still use PROC UNIVARIATE? I tried the following code (which is similar to the code I use to check residuals for my other linear regression model):
proc univariate plot data=resid;
var residual;
ods select Extremeobs plots;
run;
proc univariate data=resid normal plot;
var residual;
run;
However, my Shapiro-Wilk value comes out to <0.0001 and my residual plots come out looking not normal at all. I've attached my SAS syntax as a file to this. Any help about this would be appreciated.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
https://communities.sas.com/t5/Statistical-Procedures/bd-p/statistical_procedures
and calling @StatDave @SteveDenham @lvm
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Will do, thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I merged the posts into the one on the Stat Procs community.
@Ksharp as a superuser, you can directly move a post to the appropriate community.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Since there is only a single predictor with 5 levels of a binary response, the data can be summarized in a 5x2 table. An overall assessment of whether there are any differences among the 5 event probabilities could be obtained without need for a model by using PROC FREQ.
proc freq;
table trt*aaa / chisq;
run;
If you want to take a modeling approach and want to examine residuals, use PROC LOGISTIC since it is specialized for this model and provides various goodness of fit statistics and residuals. However, with only 5 levels of a single predictor, there are only 5 predicted values and therefore only 5 residuals, so examination of residuals is of limited value. This code provides the goodness of fit statistics and plots of all of the diagnostic residuals. It also uses the LSMEANS statement to provide pairwise comparisons among the treatments.
proc logistic;
class trt/param=glm;
model aaa(event='1')=trt / gof iplots;
lsmeans trt/plots=none ilink diff;
run;
For interpretation of the diagnostic plots, see the following:
- The example titled "Logistic Regression Diagnostics" in the PROC LOGISTIC documentation
- "Regression Diagnostics" in the Details section of the PROC LOGISTIC documentation
- This note on goodness of fit in generalized linear models (a class of models of which the logistic model is a part)
As noted in the above, these diagnostics are most used by looking for extreme outlying values which makes them more useful when you model contains continuous predictors or at least has many distinct predicted values. Cutoff values on any of the diagnostics are not really possible, but the usage note above gives some idea of how to decide if values are extreme using some of the diagnostics.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Some general thoughts regarding the modeling of binomial data, some of which might apply here.
First, using a generalized linear (LOGISTIC, GENMOD) or generalized linear mixed model (GLIMMIX) does not require either homogeneous variance or a normal distribution of residuals. Why not homogeneous variance? Because the variance can be directly expressed as a function of the mean, so if the group means differ, then group variances differ. No way around that on the original observed scale. What about normality? Look at the model being fit. There is no additional variance term as in a linear model. Now that does not mean that the data might not be overdispersed where additional variability above that due to the mean is present, but still that extra variability is not required to be gaussian (normal). Thus the usual checks for assumptions in a linear model aren't quite as appropriate.
But that does not mean that examination of the residuals is a wasted effort. It can help you evaluate model appropriateness, or aid in checking the distributional assumption. I hope this helps. If you are going to continue to fit models to binomial responses, get a good text (or on-line text) that covers the assumptions. Hosmer and Lemeshow's Applied Logistic Regression, McCullagh and Nelder's Generalized Linear Models or Stroup's Generalized Linear Mixed Models are three really good sources, in my opinion.
SteveDenham
would be good
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This helped a lot as well, thanks for the explanation, it helped cleared some of the questions that I had regarding modeling.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your help, this helped a lot. I've also gotten some advice from my thesis advisor so he cleared up some questions I had as well.