BookmarkSubscribeRSS Feed
wernie
Quartz | Level 8

I am trying to calculate incidence rate ratios and 95% CIs for the interaction between two variables: area and time. My variables are n, area, period, and age (using log of population as the offset). I have two questions that tie together for the code I am using, which is:

proc genmod data=inf;

(where age=1.....depends if I need to specify age group or not. If I want all ages, then I just leave this statement out)

class area age;

model n=area period area*period / dist=nb link=log offset=logpop type3;

estimate "Area 1 v area 2*period" area*period 1 -1 0/exp e;

estimate "Area 2 v area 3*period" area*period 0 -1 1/exp e;

estimate "Area 1 v area 3*period" area*period 1 0 -1/exp e;

run;

Q1 - I have 12 different age groups. If I include 'age' in the class statement, I get slightly different IRRs and certainly different p-values than if I do not include 'age' in the class statement. Should I be including age in the class statement for the model? There are times where I want to look at everything together (all-ages) and other times where I need to specify age groups that I want to look at, so I don't know if 'age' is needed in the class statement for each model or not.

Q2 - I'm not totally sure how to be interpreting the output for the IRs and CIs. I understand these (e.g., how to interpret IRR of something like 1.3), but it isn't seeming to make sense when I'm looking at the output. For example, with the above code (NO age listed in class statement), I get the following:

Area 1 v 2:    IRR (1.0007) 95% CI (0.9999-1.0015) p-value 0.1043

Area 2 v 3:    IRR (0.9985) 95% CI (0.9977-0.9994) p-value 0.0007

Area 1 v 3:    IRR (1.0022) 95% CI (1.0015-1.0028) p-value <0.0001

This doesn't quite make sense looking at the p-value that goes with the IRRs and CIs. However, when I DO include age in the class statement for the same dataset, I get:

Area 1 v 2:    IRR (1.0006) 95% CI (0.9990-1.0022) p-value 0.4867

Area 2 v 3:    IRR (0.9996) 95% CI (0.9979-1.0013) p-value 0.6697

Area 1 v 3:    IRR (1.0009) 95% CI (0.9994-1.0025) p-value 0.2411

This seems to make more sense to me, as looking at the IRR and CI I wouldn't expect significant results. So does this mean that age should be included in the class statement?  Thanks!

3 REPLIES 3
JacobSimonsen
Barite | Level 11

Hi Wernie,

First of all, I think it is wrong to use the negative binomial distribution to model number of events in a population study. The reason why you can use Poisson distribution with "logpop" as offset is that the likelihood function will be similar to what you get if the number of events had been Poisson-distributed. You can make inference similar to what you would do if events were Poisson-distributed, but you can not make model-assessment were you use the distribution-assumption. Therefore, it is not meaningfull to say that data is overdispersed because it is a conclusion that use the distribution-assumption, and I guess it is of that reason you use the negative-binomial distribution.

It should not matter whether you include a factor in the class statement which is not used in the model statement. That will give same result. Though, if there are missing values in that variable, these observations will be deleted when the variable is included in the class statement. I will recommend not to include variables in class that are not used in the model.

It will matter a lot for the p-values in a negative-binomial model whether or not you aggregate on a variable before you run the analysis even though it is not included it neither the model-statement or the class statement. This is not the case for the poisson-model.

Jacob

wernie
Quartz | Level 8

Thanks for the reply. I ran the Vuong macro to see which test would be best and NB was suggested. So is this wrong to follow this or is it more just common sense that it would be better to use Poisson rather than NB for something like this? (Sorry, I'm not really familiar with knowing which model is best - I had just thought because it is modelling count data that is overdispersed that it would be NB.)

I shouldn't have missing data, just some datasets that have a lot of zeros (which is also why I used the Vuong macro - to see if a zero-inflated model would be better). I will remove 'age' from the class statement then because it is not used in the model, only in a 'where' statement. I know at one point I saw the number of observations was only saying 612, but it should have been 7344 if all of the age groups were there. I think that is what you mean with aggregating the variable. So if I include age in the proc sql statement after importing the data / before running the model, then that gives me 7344 observations when I run the model (for all ages) excluding age from the class statement. I think that was one of my previous errors.

Can you clarify something for me in terms of interpreting the mean estimate with the p-value? I know traditionally if the IRR is something like 1.3 it would mean that the exposed group has 1.3 times the cases than the reference group and if the CI does not cross 1 then it is significant. But, for example, in one of the models the IRR is 1.0028 (CI 1.0015 - 1.0042) p-value <0.0001. So this would only be 0.28% greater than the reference scenario and it is showing it as significant. But am I interpreting this incorrectly? I get similar results trying it with Poisson as well. Thanks again!

JacobSimonsen
Barite | Level 11

It is not always meaningfull to use the disperson index to select a model. If you have time-to-event data, and summarize these data with number of events and number of personyears on each combination of covariates, then you can analyze the data with Poisson-regression. But, that is just a trick to make inference about parameters. In such case you can not use anykind of statistics that make use of the poisson-distribution, therefore you can also not use the dispersonindex and analyze the data with negative binomial instead.

Even if you generate your own time-to-event data with piecewise-constant hazard-rates, and analyze with Poisson regegression you can observe a dispersion index far from 1. Even though that all assumptions for Poisson regression was fulfilled.

The reason why the p-values can change so dramatically is that in the negative binomial distribution it involves a variance parameter. If your data is aggregated on one more covaraite (even that this covariate is not included in the model!) then the variance parameter will be much smaller, which in turns will make confidenceintervals more narrow and p-values smaller.

I can illustrate the princip in this simple example: Let say you just have one covariate "a". then


data mydata;
  input a count personyears;
  logpop=log(personyears);
  cards;
0 10 10
1 20 40
;
run;
proc genmod data=mydata;
  class a;
  model count=a/dist=nb link=log offset=logpop;
  estimate 'a' a 1 -1;
run;

Now, same data, but in addition a covariate "b" is also observed and data is aggregated on "b" as well;

data mydata;
  input a b count personyears;
  logpop=log(personyears);
  cards;
0 0 6 5
0 1 4 5
1 0 10 20
1 1 10 20
;
run;


proc genmod data=mydata;
  class a;
  model count=a/dist=nb link=log offset=logpop;
  estimate 'a' a 1 -1;
run;

*with the negative binomial model you will get other p-values, but same mean estimates.

*with Poisson you will get same results with the two models.;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 3576 views
  • 0 likes
  • 2 in conversation