I am experiencing a similar problem discussed in the following post:
I am attempting to use graphics to illustrate the fit of both a Poisson model and negative binomial model to my data prior to model building. The data I have is count data with an offset examining incidence risk. The outcome is the number of new infections and the offset is the natural log of the number of subjects susceptible. According to the fit statistics ( Deviance/Pearson Chi-Square) a negative binomial distribution has a better fit. However, when I try to fit expected distribution curves to my data (PDF) or to create CDF curves, negative binomial clearly does not fit my data. I believe my problem is in the parameter estimation for my curves and not actually the lack-of-fit. When I use my code without an offset, both graphs are produced "correctly". When I include the offset, then the negative binomial graph does not fit. Can anyone provide me with assistance or suggestions in parameter estimation? Below is the code (including the offset) I have utilized from various resources online and within the SAS community. I did not include the code without the offset, for the sake of space, but can provide if desired. The code is identical without the offset included within my Proc Genmod commands. I have also included my data.
/* Tabulate counts and plot data */
proc freq data=herdsummary;
tables HerdConv / out=FreqOut plots=FreqPlot(scale=percent);
run;
proc univariate data=herdsummary;
var herdconv;
cdfplot herdconv / vscale=proportion
odstitle="Empirical CDF" odstitle2="PROC UNIVARIATE";
ods output cdfplot=outCDF; /* data set contains ECDF values */
run;
**********************CDF- OFFSET************************;
proc genmod data=herdsummary;
TITLE "Poisson With Offset";
model herdconv= /dist=Poisson offset=logherdretest dscale; /* No variables are specified, only mean is estimated. */
output out=PoissonFitOS p=lambda;
run;
/* Save Poisson parameter lambda in macro variables. */
data _null_;
set PoissonFitOS(obs=1);
call symputx('lambda', lambda);
run;
/* Use Min/Max values and the fitted Lambda to create theoretical Poisson Data. */
data TheoreticalPoissonOS;
do HerdConv= 0 to 15;
po=cdf('Poisson', HerdConv, &lambda);
output;
end;
run;
proc genmod data=herdsummary;
TITLE "Negative Binomial with Offset";
model HerdConv= /dist=NegBin offset=logherdretest; /* No variables are specified, only mean is estimated */
output out= herdsummaryNB p=NBPred;
ods output parameterestimates=NegBinParametersOS;
run;
/* Transpose Data. */
proc transpose data=NegBinParametersOS out=NegBinParametersOS;
var estimate;
id parameter;
run;
/* Calculate k and p from intercept and dispersion parameters. */
data NegBinParametersOS;
set NegBinParametersOS;
k = 1/(dispersion);
p = 1/(1+exp(intercept)*dispersion);
run;
/* Save k and p in macro variables. */
data _null_;
set NegBinParametersOS;
call symputx('k', k);
call symputx('p', p);
run;
/* Calculate theoretical Negative Binomial PMF based on fitted k and p. */
data TheoreticalNegBinOS;
do HerdConv=0 to 15;
NegBin=cdf('NegBinomial', HerdConv, &p, &k);
output;
end;
run;
data testOS;
Merge theoreticalpoissonOS TheoreticalNegBinOS outcdf;
Run;
proc sgplot data=testOS;
TITLE " Cumulative with Offset";
step x=ECDFX y=ECDFY / lineattrs=(pattern=1 thickness=2)legendlabel="ECDF";
step x=HerdConv y=PO /lineattrs=(pattern=15 thickness=2) legendlabel="Poisson Model Fit";
Step x=Herdconv y=NegBin/lineattrs=(pattern=2 thickness=2) legendlabel="NegBin Model Fit";
xaxis grid label="x" offsetmin=0.05 offsetmax=0.05;
yaxis grid min=0 label="Cumulative Proportion";
run;
***********;
/* Fit Poisson distribution to data. */
proc genmod data=herdsummary;
TITLE "Poisson with Offset";
model herdconv= /dist=Poisson offset=logherdretest; /* No variables are specified, only mean is estimated. */
output out=PoissonFitO p=lambda;
run;
/* Save Poisson parameter lambda in macro variables. */
data _null_;
set PoissonFitO(obs=1);
call symputx('lambda', lambda);
run;
/* Use Min/Max values and the fitted Lambda to create theoretical Poisson Data. */
data TheoreticalPoissonO;
do HerdConv= 0 to 15;
po=pdf('Poisson', HerdConv, &lambda);
output;
end;
run;
/* Fit Negative Binomial distribution to data. */
proc genmod data=herdsummary;
TITLE "Negative Bin with Offset";
model herdconv= /dist=NegBin offset=logherdretest; /* No variables are specified, only mean is estimated */
ods output parameterestimates=NegBinParametersO;
run;
/* Transpose Data. */
proc transpose data=NegBinParametersO out=NegBinParametersO;
var estimate;
id parameter;
run;
/* Calculate k and p from intercept and dispersion parameters. */
data NegBinParametersO;
set NegBinParametersO;
k = 1/dispersion;
p = 1/(1+exp(intercept)*dispersion);
run;
/* Save k and p in macro variables. */
data _null_;
set NegBinParametersO;
call symputx('k', k);
call symputx('p', p);
run;
/* Calculate theoretical Negative Binomial PMF based on fitted k and p. */
data TheoreticalNegBinO;
do herdconv=0 to 15;
NegBin=pdf('NegBinomial', herdconv, &p, &k);
output;
end;
run;
/* Merge The datasets for plotting. */
data PlotDataO(keep=herdconv freq po negbin);
merge TheoreticalPoissonO TheoreticalNegBinO FreqOut;
by herdconv;
freq = PERCENT/100;
run;
/* Overlay fitted densities with original data. */
title 'Count data overlaid with fitted distributions-WITH OFFSET';
proc sgplot data=PlotDataO noautolegend;
vbarparm category=herdconv response=freq / legendlabel='Count Data';
series x=herdconv y=po / markers markerattrs=(symbol=circlefilled color=red)
lineattrs=(color=red)legendlabel='Fitted Poisson PMF';
series x=herdconv y=NegBin / markers markerattrs=(symbol=squarefilled color=green)
lineattrs=(color=green)legendlabel='Fitted Negative Binomial PMF';
xaxis display=(nolabel);
yaxis display=(nolabel);
keylegend / location=inside position=NE across=1;
run;
title;
You asked:
Does this assume that each set of cows has the same infection probability or each cow within a set has the same probability.
The latter is assumed. Each set (observation) is considered a population by default and each population can have a different probability depending on the setting of the predictors. So, if you just have sets of cows and want to model differences in the probabilities among the sets, then you should be able to just fit a simple logistic model in PROC LOGISTIC. You do not need a mixed model or GEE (MA) model unless there are correlations among the sets to be accounted for.
It looks like you might have used the article "Fitting a Poisson distribution to data in SAS" for some of these computations. The ideas and code in that article assume that you are dealing with raw counts (no offset).
Your data are not raw counts, they are rates of the form HeardConv / HerdRetest.
To model the rates, use the offset variable log(herdretest). I only briefly looked at your program (can you reformat?) but it looks like you are modeling rates but plotting the CDF and PMF of the raw counts, which doesn't make sense. You need to compare rates to rates.
Thank you for your response @Rick_SAS. I did indeed use your article. I do see that the code is not fully showing, and I am uncertain why. I have included the offset in my calculations, I believe. The reason I overlayed the raw counts with the PMF was I visually trying to show what the outcome actually was versus what it should be when the offset is included. So should I only be using the CDF? Or maybe I misunderstanding. If so, how do I graph the predicted rate versus the observed rate?
Can you explain your variables?
HerdConv = ?
HerdRetest = ?
What does each row represent?
I don't understand what you gain in the Poison model by keeping the different herds. For the code that you submitted, it seems like the parameter estimates will be the same as if you aggregate the data and declare that 350 cows got sick out of a herd of 1317.
I don't understand why a Poisson rate model is suitable for these data. There are lots of people on this forum who are more experienced with generalized linear models than me, so I'm calling for assistance. It looks like you used this article by @PeterClemmensen, so he might have some thoughts. I'd also be interested in insight from @StatDave.
If a cow can only be counted as infected if it was retested, then the ratio HerdConv/HerdRetest is a proportion rather than a rate with each cow being a Bernoulli trial and the number of cows of those retested is distributed binomial. This assumes that the set of cows retested are independent and have the same infection probability. As such, you would model this using logistic regression with one observation in your data set representing each set of cows retested. Any predictors would apply to the set of cows in each observation. Code would look like:
proc logistic; model herdconv/herdretest = x1 x2 ; run;
You asked:
Does this assume that each set of cows has the same infection probability or each cow within a set has the same probability.
The latter is assumed. Each set (observation) is considered a population by default and each population can have a different probability depending on the setting of the predictors. So, if you just have sets of cows and want to model differences in the probabilities among the sets, then you should be able to just fit a simple logistic model in PROC LOGISTIC. You do not need a mixed model or GEE (MA) model unless there are correlations among the sets to be accounted for.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.