Statistical Procedures

Programming the statistical procedures from SAS
BookmarkSubscribeRSS Feed
jimmymonte
Calcite | Level 5

Hi all, 

 

I have a question about checking the residual normality for binomial data that is being analyzed with PROC GLIMMIX. I have 291 subjects with treatment and pen being the classes. The analysis is to determine if there was a statistically significant difference between treatments for subjects that graded AAA vs ones that didn't, therefore I've organized my data as a binomial distribution; if they graded AAA (1) vs if they didn't (0). Below is the code I am using to analyze the data. 

 

proc glimmix;

class trt pen;
model aaa = trt / d=binomial link=log;
output out=resid pred=predicted residual=residual;
lsmeans trt / diff lines ilink;

contrast 'CTL vs Treatments' trt 4 -1 -1 -1 -1;
contrast 'CTL vs Low Straw' trt 2 0 0 -1 -1;
contrast 'CTL vs High Straw' trt 2 -1 -1 0 0;
contrast 'Low Straw vs High Straw' trt 0 1 1 -1 -1;
contrast 'Canola vs Flax' trt 0 1 -1 1 -1;

run;

 

However, I'm running into issues when I try to analyze the residuals. Primarily because in our stats class they didn't specify how to analyze residuals for binomial data and what the assumptions are for models that analyze binomial data. Are the residuals still supposed to be normally distributed? If not, how would you go about analyzing the residuals, would you still use PROC UNIVARIATE? I tried the following code (which is similar to the code I use to check residuals for my other linear regression model): 

 

proc univariate plot data=resid;
var residual;
ods select Extremeobs plots;
run;

 

proc univariate data=resid normal plot;
var residual;
run;

 

However, my Shapiro-Wilk value comes out to <0.0001 and my residual plots come out looking not normal at all. I've attached my SAS syntax as a file to this. Any help about this would be appreciated. 

 

Thanks

8 REPLIES 8
jimmymonte
Calcite | Level 5

Hi all, 

 

I have a question about checking the residual normality for binomial data that is being analyzed with PROC GLIMMIX. I have 291 subjects with treatment and pen being the classes. The analysis is to determine if there was a statistically significant difference between treatments for subjects that graded AAA vs ones that didn't, therefore I've organized my data as a binomial distribution; if they graded AAA (1) vs if they didn't (0). Below is the code I am using to analyze the data. 

 

proc glimmix;

class trt pen;
model aaa = trt / d=binomial link=log;
output out=resid pred=predicted residual=residual;
lsmeans trt / diff lines ilink;

contrast 'CTL vs Treatments' trt 4 -1 -1 -1 -1;
contrast 'CTL vs Low Straw' trt 2 0 0 -1 -1;
contrast 'CTL vs High Straw' trt 2 -1 -1 0 0;
contrast 'Low Straw vs High Straw' trt 0 1 1 -1 -1;
contrast 'Canola vs Flax' trt 0 1 -1 1 -1;

run;

 

However, I'm running into issues when I try to analyze the residuals. Primarily because in our stats class they didn't specify how to analyze residuals for binomial data and what the assumptions are for models that analyze binomial data. Are the residuals still supposed to be normally distributed? If not, how would you go about analyzing the residuals, would you still use PROC UNIVARIATE? I tried the following code (which is similar to the code I use to check residuals for my other linear regression model): 

 

proc univariate plot data=resid;
var residual;
ods select Extremeobs plots;
run;

 

proc univariate data=resid normal plot;
var residual;
run;

 

However, my Shapiro-Wilk value comes out to <0.0001 and my residual plots come out looking not normal at all. I've attached my SAS syntax as a file to this. Any help about this would be appreciated. 

 

Thanks

StatDave
SAS Super FREQ

Since there is only a single predictor with 5 levels of a binary response, the data can be summarized in a 5x2 table. An overall assessment of whether there are any differences among the 5 event probabilities could be obtained without need for a model by using PROC FREQ. 

proc freq; 
table trt*aaa / chisq; 
run;

If you want to take a modeling approach and want to examine residuals, use PROC LOGISTIC since it is specialized for this model and provides various goodness of fit statistics and residuals. However, with only 5 levels of a single predictor, there are only 5 predicted values and therefore only 5 residuals, so examination of residuals is of limited value. This code provides the goodness of fit statistics and plots of all of the diagnostic residuals. It also uses the LSMEANS statement to provide pairwise comparisons among the treatments. 

proc logistic; 
class trt/param=glm;
model aaa(event='1')=trt / gof iplots;
lsmeans trt/plots=none ilink diff;
run;

For interpretation of the diagnostic plots, see the following:

  • The example titled "Logistic Regression Diagnostics" in the PROC LOGISTIC documentation
  • "Regression Diagnostics" in the Details section of the PROC LOGISTIC documentation 
  • This note on goodness of fit in generalized linear models (a class of models of which the logistic model is a part)

As noted in the above, these diagnostics are most used by looking for extreme outlying values which makes them more useful when you model contains continuous predictors or at least has many distinct predicted values. Cutoff values on any of the diagnostics are not really possible, but the usage note above gives some idea of how to decide if values are extreme using some of the diagnostics.

SteveDenham
Jade | Level 19

Some general thoughts regarding the modeling of binomial data, some of which might apply here.

 

First, using a generalized linear (LOGISTIC, GENMOD) or generalized linear mixed model (GLIMMIX) does not require either homogeneous variance or a normal distribution of residuals. Why not homogeneous variance? Because the variance can be directly expressed as a function of the mean, so if the group means differ, then group variances differ. No way around that on the original observed scale.  What about normality? Look at the model being fit. There is no additional variance term as in a linear model.  Now that does not mean that the data might not be overdispersed where additional variability above that due to the mean is present, but still that extra variability is not required to be gaussian (normal).  Thus the usual checks for assumptions in a linear model aren't quite as appropriate.

 

But that does not mean that examination of the residuals is a wasted effort. It can help you evaluate model appropriateness, or aid in checking the distributional assumption.  I hope this helps. If you are going to continue to fit models to binomial responses, get a good text (or on-line text) that covers the assumptions. Hosmer and Lemeshow's Applied Logistic RegressionMcCullagh and Nelder's  Generalized Linear Models or Stroup's Generalized Linear Mixed Models are three really good sources, in my opinion.

 

SteveDenham

would be good

 

jimmymonte
Calcite | Level 5

This helped a lot as well, thanks for the explanation, it helped cleared some of the questions that I had regarding modeling. 

jimmymonte
Calcite | Level 5

Thanks for your help, this helped a lot. I've also gotten some advice from my thesis advisor so he cleared up some questions I had as well. 

sas-innovate-white.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.

 

Save $200 when you sign up by March 14!

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 875 views
  • 6 likes
  • 5 in conversation