Hello everyone,
First post, but long-time SAS user! I've been having trouble with proc genmod. It's stumped a few people in our department, so I thought I would take it online. I am on SAS version 9.3.
I will briefly give a background on our study. We are examining motor vehicle fatalities (varname pvofat2) at the county level. One of the predictor variables we want to look at is a certain type of seat belt law in effect in each state. We are using a four-level categorical variable (varname lawall). As we care about the rate per person, it is necessary to use the natural log of the population in each county as an offset (varname lnpop).
To begin, I am simply trying to estimate crude rates/rate ratios for each type of law. When I run the following model with a Poisson distribution, it works as I would expect:
*Model to get p-value by law only (rate ratios and rate estimate for prim/prim county type, poisson distribution);
proc genmod data=totrate; class lawall/param=glm;
model pvofat2 =LAWALL/offset=lnpop dist=poisson link=log type3;
lsmeans lawall/ilink exp;
ods output lsmeans=myfile;
run;
These results match what I calculate by hand in excel. The p-values match what I get from Open-Epi (an online app to perform basic statistics, including chi-square tests). However, when I run the exact same code using a negative binomial distribution, things are off. The rate estimates are off significantly. Rate ratio estimates are off as well (judging from exponentiated model parameters). Here is the code:
*Model to get p-value by law only (rate ratios and rate estimates, negbin); proc genmod data=totrate;
class lawall/param=glm;
model pvofat2 =LAWALL/offset=lnpop dist=negbin link=log type3;
lsmeans lawall/ilink exp;
ods output lsmeans=myfile;
run;
As I said, the actual rate estimates for counties are off significantly. Setting a Poisson distribution gives me a rate of 7.65 per 100,000 for one level of my categorical predictor. This matches what I calculated by hand. The same estimate using a negative binomial distribution is 13.68 per 100,000. According to the output log, my model is converging.
Exploring things a bit by myself, I think the problem has something to do with the offset option. Whenever I don’t use an offset, the model should be estimating the average number of deaths by county for each type of law. These numbers match in poisson models, negative binomial models, and when calculating by hand:
*Model to get p-value by law only (average fats/county, NB);
proc genmod data=totrate;
class lawall/param=glm;
model pvofat2 =LAWALL/dist=negbin link=log type3;
lsmeans lawall/ilink exp;
ods output lsmeans=myfile;
run;
Getting identical (and correct) results without the offset option seems fishy to me. Do I need to use a different offset option in negative binomial options? Is offsetting not possible in NB models? Are there other things I have not yet considered? I've tried running these models on different variables as well, and I'm consistently getting the same problem. One last thing I will note is that the dispersion parameter is <1 (underdispersion). Unless I'm mistaken, the point estimates should be the same in Poisson and negative binomial models, correct? I'm really just rambling at this point.
I would greatly appreciate any help anyone had. Output available upon request.
EDIT: to organize up code and correct one error in code comments.
It is wrong to apply the neg-bin distribution to this kind of data - even if you see a significant overdispersion when applying the Poisson option. The reason for this is that the data doesn't need to be Poisson distributed. Actually, the data which is behind your number of events is time-to-event data. If the assumption of piecewise constant rates are fullfilled, then data can be analyzed by poisson regression. This is because the likelihood function in the Poisson regression is exactly the likelihood function you want to maximize if you had the original time-to-event data. Therefore, it is wrong to validate the poison-distribution of data, it is only a trick to maximize the likelihood function and thereby make estimates and relevant hyphotesis testing about covariates. Don't use neg-bin distribution of such data, it will maybe give correct rate-ratio estimates, but not any valid information about confidenceintervals.
Its quite easy to illustrate this by simulating time-to-event data, and fit this with Poisson-regression. The model is true, even that it gives significant p-value for overdispersion or underdispersion. This doesn't mean that you should apply the neg-bin model.
data silly_data;
rate=0.2;
do group=1 to 2;
do i=1 to 10000;
event=1;
time=rand('exponential',1/rate);
censortime=rand('exponential',1);
event=(time<censortime);
obstime=min(censortime,time);
logtime=log(obstime);
output;
end;
end;
drop time censortime;
run;
proc summary data=silly_data nway;
var event obstime;
class group;
output out=summary sum=event sumtime;
ruN;
data summary;
set summary;
logtime=log(sumtime);
run;
*estimate the rate paramater using aggregated form of the data;
proc genmod data=summary;
model event=/dist=poisson link=log dist=poisson offset=logtime;
estimate 'rate' intercept 1;
run;
If the data was truly count-data, then it is much more relevant to look on the assumption of poisondistributed data, and then the neg-bin distribution can be a better and meaningfull way to model the data.
Are you seeing problems with all counties or just those with low populations? I ask because having done a littel work with crashes and fatalities that some very low population counties have had odd fatality rates because most of the actual traffic and fatality count is from out-of-county residents driving through.
When I say low population think fewer than 1,100 residents in a county the size of Rhode Island or 12,000 in a county the size of Massachusetts.
It is wrong to apply the neg-bin distribution to this kind of data - even if you see a significant overdispersion when applying the Poisson option. The reason for this is that the data doesn't need to be Poisson distributed. Actually, the data which is behind your number of events is time-to-event data. If the assumption of piecewise constant rates are fullfilled, then data can be analyzed by poisson regression. This is because the likelihood function in the Poisson regression is exactly the likelihood function you want to maximize if you had the original time-to-event data. Therefore, it is wrong to validate the poison-distribution of data, it is only a trick to maximize the likelihood function and thereby make estimates and relevant hyphotesis testing about covariates. Don't use neg-bin distribution of such data, it will maybe give correct rate-ratio estimates, but not any valid information about confidenceintervals.
Its quite easy to illustrate this by simulating time-to-event data, and fit this with Poisson-regression. The model is true, even that it gives significant p-value for overdispersion or underdispersion. This doesn't mean that you should apply the neg-bin model.
data silly_data;
rate=0.2;
do group=1 to 2;
do i=1 to 10000;
event=1;
time=rand('exponential',1/rate);
censortime=rand('exponential',1);
event=(time<censortime);
obstime=min(censortime,time);
logtime=log(obstime);
output;
end;
end;
drop time censortime;
run;
proc summary data=silly_data nway;
var event obstime;
class group;
output out=summary sum=event sumtime;
ruN;
data summary;
set summary;
logtime=log(sumtime);
run;
*estimate the rate paramater using aggregated form of the data;
proc genmod data=summary;
model event=/dist=poisson link=log dist=poisson offset=logtime;
estimate 'rate' intercept 1;
run;
If the data was truly count-data, then it is much more relevant to look on the assumption of poisondistributed data, and then the neg-bin distribution can be a better and meaningfull way to model the data.
Thank you for this. My version of SAS is not running some of your code, including a model without predictors. I think this explains why I could not find many examples of negative binomial models being used in similar situations, however. Your claims about semi-accurate rate ratios, but inaccurate rates, match what I see in my data, as well. (edit: rate ratio estimates were relatively accurate, not entirely accurate).
I would appreciate any textbook or journal articles that would elaborate on your points.
Lots of book in survival analysis explains how Poisson regression can be used to estimate rates and rate ratios. This one I can especially recommend: "Survival and Event History Analysis: A Process Point of View" by Odd O. Aalen.link. The argument I refer to is on page 223-225
For a GENMOD-centric treatment, see this SAS Usage Note: "Modeling rates and estimating rates and rate ratios (with confidence intervals)"
I believe the solution by @JacobSimonsen could be helpful to my analysis. I am analyzing the number of procedures done by a Fellow each year (count) from 2008 to 2017 with an offset for the number of admissions in that year (lnAdmissions), dependent variables include the type of procedure (cateogircal; 1,2,3) and the number of years since 2008 (my time variable). I am also analyzing the interaction between procedure*time because there is a hypothesis that each type of procedure affects the rate differently over time.
When running poisson:
proc genmod data = longSummary order=data; class procedure; model count = yearsSince2008 procedure procedure*yearsSince2008 / dist=poisson type3 offset=lnAdmissions link=log ; title "Poisson regression with offset and interaction"; run;title; footnote;
I am faced with the (deviance value / DF) = 4 which suggests overdispersion of the data. I have also seen from multiple sources that as lambda increases, the variability of the data increases so there is no surprise this is showing overdispersion. I was inclined to go with the negative binomial route for accounting for overdispersed data - this does result in lower AIC and does not change any conclusions for the type 3 p values. My question is does my data fit the statement you made :
It is wrong to apply the neg-bin distribution to this kind of data - even if you see a significant overdispersion when applying the Poisson option. The reason for this is that the data doesn't need to be Poisson distributed. Actually, the data which is behind your number of events is time-to-event data. If the assumption of piecewise constant rates are fulfilled, then data can be analyzed by poisson regression.
Any help/suggestions would be greatly appreciated.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.