I've got a really simple model, in which I am predicting total counts (of countries' medals at Pyeongchang) with countries' populations. I am interested in interpreting the residuals as a measure of "sportiness" . (I will also be adjusting for GDP per capita, latitude, and a few other things, but let's keep it simple right now.), Here's the essential code:
proc glimmix data=all;
class Mean Country;
model Total=Mean &pred/s noint link=log dist=Poisson;
random _residual_/subject=Country;
lsmeans Mean/ilink cl alpha=α
estimate "&pred %/%" &pred 1/alpha=α
output out=pred resid=Resid resid(ilink)=ResidBT student=StudentResid predicted=Pred;
ods output covparms=cov;
I've got lots of zero medal counts for countries that sent a team to the Olympics but got no medals. The countries with zero counts have a residual of -1. I cannot find anything in SAS or on the web to explain why that value has been chosen and what it means. Here's the first few lines of the pred dataset to illustrate:
ResidBT is the difference between the observed count and the back-transformed predicted count, and I have checked manually with the parameter estimates that it's correct. What I can't figure out is how to work with Resid: it is not exactly the same as log(Total) minus Pred, and, of course, you can't use log(Total) anyway when Total=0. But why -1, and how are the other resids estimated for non-zero counts?
Thanks, guys.
Will
Hi Will,
Not sure if this is the issue, but if you have a lot of zero's in your data, then a simple fit to a Poisson distribution may not be the best approach. You may have a zero-inflated distribution. Two ways to approach this: Mixture of Poisson and a mass at zero, or a hurdle model (also a mixture, but one of the distributions is binomial, with the probability there splitting the full population into "structured" zeros and everything else. That might be interesting for your other predictors, as they may have a bigger influence on the split than on the actual level in the non-zero observations.
Of course, that answer tells you close to nothing about how to calculate residuals. Try PROC FMM for the binomial-Poisson mixture and PROC GENMOD for a strictly zero inflated Poisson.
SteveDenham
Thanks, Steve. Yes, I was thinking of geting around to playing with Proc FMM, but first I want to see if including all my predictors reduces the overdispersion enough not to have to worry about it. Certainly, at the moment the overdispersion is huge, owing partly to the high proportion of zeros.
Meantime, as per your suggestion, I got it going in Proc Genmod. There was no "straight" residual in Genmod, I got the same predicted values as in Glimmix (with or without overdispersion) and the Pearson residual in Genmod was the same as the chi-squared residual in Glimmix (without overdispersion) for all observations, including those with zero counts (which differed between observations with zero counts). The Pearson or chi-squared residuals are derived presumably by dividing -1 by the appropriate sampling SD. It looks like -1 is the lowest possible value for the straight resid; even one medal for a country with a huge population gives a value a bit less negative than -1. So the value of -1 might be something to do with getting the right estimate for the Pearson residual when the count is zero, but it's still weird, because you would think that the residual when the observed count is zero should be more negative for larger predicted values. So I am stumped.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.