Solved: Re: Count Model Fit: Parameter Estimation for Visual Assessment

HHutch · Posted 10-30-2017 10:21 AM

I am experiencing a similar problem discussed in the following post:

https://communities.sas.com/t5/SAS-GRAPH-and-ODS-Graphics/proc-genmod-graphics-for-count-data-model-...

I am attempting to use graphics to illustrate the fit of both a Poisson model and negative binomial model to my data prior to model building. The data I have is count data with an offset examining incidence risk. The outcome is the number of new infections and the offset is the natural log of the number of subjects susceptible. According to the fit statistics ( Deviance/Pearson Chi-Square) a negative binomial distribution has a better fit. However, when I try to fit expected distribution curves to my data (PDF) or to create CDF curves, negative binomial clearly does not fit my data. I believe my problem is in the parameter estimation for my curves and not actually the lack-of-fit. When I use my code without an offset, both graphs are produced "correctly". When I include the offset, then the negative binomial graph does not fit. Can anyone provide me with assistance or suggestions in parameter estimation? Below is the code (including the offset) I have utilized from various resources online and within the SAS community. I did not include the code without the offset, for the sake of space, but can provide if desired. The code is identical without the offset included within my Proc Genmod commands. I have also included my data.

/* Tabulate counts and plot data */
proc freq data=herdsummary;
 tables HerdConv / out=FreqOut plots=FreqPlot(scale=percent);
run;

proc univariate data=herdsummary;
var herdconv;
cdfplot herdconv / vscale=proportion 
 odstitle="Empirical CDF" odstitle2="PROC UNIVARIATE";
ods output cdfplot=outCDF; /* data set contains ECDF values */
run;
**********************CDF- OFFSET************************;

proc genmod data=herdsummary;
 TITLE "Poisson With Offset";
 model herdconv= /dist=Poisson offset=logherdretest dscale; /* No variables are specified, only mean is estimated. */
 output out=PoissonFitOS p=lambda;
run;
 
/* Save Poisson parameter lambda in macro variables. */
data _null_;
 set PoissonFitOS(obs=1);
 call symputx('lambda', lambda);
run;
 
/* Use Min/Max values and the fitted Lambda to create theoretical Poisson Data. */
data TheoreticalPoissonOS;
 do HerdConv= 0 to 15;
 po=cdf('Poisson', HerdConv, &lambda);
 output;
 end;
run;


proc genmod data=herdsummary;
 TITLE "Negative Binomial with Offset";
 model HerdConv= /dist=NegBin offset=logherdretest; /* No variables are specified, only mean is estimated */
 output out= herdsummaryNB p=NBPred;
 ods output parameterestimates=NegBinParametersOS;
run;
 
/* Transpose Data. */
proc transpose data=NegBinParametersOS out=NegBinParametersOS;
 var estimate;
 id parameter;
run;
 
/* Calculate k and p from intercept and dispersion parameters. */
data NegBinParametersOS;
 set NegBinParametersOS;
 k = 1/(dispersion);
 p = 1/(1+exp(intercept)*dispersion);
run;
 
/* Save k and p in macro variables. */
data _null_;
 set NegBinParametersOS;
 call symputx('k', k);
 call symputx('p', p);
run;
 
/* Calculate theoretical Negative Binomial PMF based on fitted k and p. */
data TheoreticalNegBinOS;
 do HerdConv=0 to 15;
 NegBin=cdf('NegBinomial', HerdConv, &p, &k);
 output;
 end;
run;

data testOS;
 Merge theoreticalpoissonOS TheoreticalNegBinOS outcdf;
Run;

proc sgplot data=testOS;
 TITLE " Cumulative with Offset";
 step x=ECDFX y=ECDFY / lineattrs=(pattern=1 thickness=2)legendlabel="ECDF";
 step x=HerdConv y=PO /lineattrs=(pattern=15 thickness=2) legendlabel="Poisson Model Fit";
 Step x=Herdconv y=NegBin/lineattrs=(pattern=2 thickness=2) legendlabel="NegBin Model Fit";
 xaxis grid label="x" offsetmin=0.05 offsetmax=0.05;
 yaxis grid min=0 label="Cumulative Proportion";
run;
***********;


/* Fit Poisson distribution to data. */
proc genmod data=herdsummary;
 TITLE "Poisson with Offset";
 model herdconv= /dist=Poisson offset=logherdretest; /* No variables are specified, only mean is estimated. */
 output out=PoissonFitO p=lambda;
run;
 
/* Save Poisson parameter lambda in macro variables. */
data _null_;
 set PoissonFitO(obs=1);
 call symputx('lambda', lambda);
run;
 
/* Use Min/Max values and the fitted Lambda to create theoretical Poisson Data. */
data TheoreticalPoissonO;
 do HerdConv= 0 to 15;
 po=pdf('Poisson', HerdConv, &lambda);
 output;
 end;
run;

/* Fit Negative Binomial distribution to data. */

proc genmod data=herdsummary;
 TITLE "Negative Bin with Offset";
 model herdconv= /dist=NegBin offset=logherdretest; /* No variables are specified, only mean is estimated */
 ods output parameterestimates=NegBinParametersO;
run;

 
/* Transpose Data. */
proc transpose data=NegBinParametersO out=NegBinParametersO;
 var estimate;
 id parameter;
run;
 
/* Calculate k and p from intercept and dispersion parameters. */
data NegBinParametersO;
 set NegBinParametersO;
 k = 1/dispersion;
 p = 1/(1+exp(intercept)*dispersion);
run;
 
/* Save k and p in macro variables. */
data _null_;
 set NegBinParametersO;
 call symputx('k', k);
 call symputx('p', p);
run;
 
/* Calculate theoretical Negative Binomial PMF based on fitted k and p. */
data TheoreticalNegBinO;
 do herdconv=0 to 15;
 NegBin=pdf('NegBinomial', herdconv, &p, &k);
 output;
 end;
run;

/* Merge The datasets for plotting. */
data PlotDataO(keep=herdconv freq po negbin);
 merge TheoreticalPoissonO TheoreticalNegBinO FreqOut;
 by herdconv;
 freq = PERCENT/100;
run;


/* Overlay fitted densities with original data. */
title 'Count data overlaid with fitted distributions-WITH OFFSET';
proc sgplot data=PlotDataO noautolegend;
 vbarparm category=herdconv response=freq / legendlabel='Count Data';
 series x=herdconv y=po / markers markerattrs=(symbol=circlefilled color=red) 
 lineattrs=(color=red)legendlabel='Fitted Poisson PMF';
 series x=herdconv y=NegBin / markers markerattrs=(symbol=squarefilled color=green) 
 lineattrs=(color=green)legendlabel='Fitted Negative Binomial PMF';
 xaxis display=(nolabel);
 yaxis display=(nolabel);
 keylegend / location=inside position=NE across=1;
run;
title;

I have attached graphs generated with and without the offset. As you see, the code "works/gives an expected distribution" when an offset is not included. However, when an offset IS included, there is a spike at zero not explained by the data.

StatDave · Posted 10-31-2017 01:44 PM

You asked:

Does this assume that each set of cows has the same infection probability or each cow within a set has the same probability.

The latter is assumed. Each set (observation) is considered a population by default and each population can have a different probability depending on the setting of the predictors. So, if you just have sets of cows and want to model differences in the probabilities among the sets, then you should be able to just fit a simple logistic model in PROC LOGISTIC. You do not need a mixed model or GEE (MA) model unless there are correlations among the sets to be accounted for.

View solution in original post

Rick_SAS · Posted 10-30-2017 03:33 PM

It looks like you might have used the article "Fitting a Poisson distribution to data in SAS" for some of these computations. The ideas and code in that article assume that you are dealing with raw counts (no offset).

Your data are not raw counts, they are rates of the form HeardConv / HerdRetest.

To model the rates, use the offset variable log(herdretest). I only briefly looked at your program (can you reformat?) but it looks like you are modeling rates but plotting the CDF and PMF of the raw counts, which doesn't make sense. You need to compare rates to rates.

HHutch · Posted 10-30-2017 04:38 PM

Thank you for your response @Rick_SAS. I did indeed use your article. I do see that the code is not fully showing, and I am uncertain why. I have included the offset in my calculations, I believe. The reason I overlayed the raw counts with the PMF was I visually trying to show what the outcome actually was versus what it should be when the offset is included. So should I only be using the CDF? Or maybe I misunderstanding. If so, how do I graph the predicted rate versus the observed rate?

Rick_SAS · Posted 10-30-2017 05:01 PM

Can you explain your variables?

HerdConv = ?

HerdRetest = ?

What does each row represent?

HHutch · Posted 10-30-2017 05:04 PM

So I am trying to explain disease incidence across 108 cow herds. Each row represents a different herd. HerdConv= the total number of cows that became infected and HerdRetest= the total number of cows that were retested. This was during a 2 yr follow up period and cows were only retested once. Therefore, I can't really do rate analysis because I do not know the time at which they became infected. I just know that they were infected within the two year period. So I was trying to utilize a count model while keeping in mind that the number retested/herd was not the same.

Rick_SAS · Posted 10-30-2017 08:25 PM

I don't understand what you gain in the Poison model by keeping the different herds. For the code that you submitted, it seems like the parameter estimates will be the same as if you aggregate the data and declare that 350 cows got sick out of a herd of 1317.

I don't understand why a Poisson rate model is suitable for these data. There are lots of people on this forum who are more experienced with generalized linear models than me, so I'm calling for assistance. It looks like you used this article by @PeterClemmensen, so he might have some thoughts. I'd also be interested in insight from @StatDave.

HHutch · Posted 10-30-2017 08:34 PM

@Rick_SAS I appreciate your feedback and time. The intentions for my
analysis is to use different herd-level factors as potential predictors for
cows to become sick (such as herd prevalence). I am fairly new to
statistical analysis so I could be completely wrong and poisson rate model
may be inappropriate. I have considered a logistic regression model
accounting for random herd effects. However, I do not have cow-level
predictors. That is why I thought a count model might be more appropriate.
Yes, I did use the article you linked. Thanks again!

StatDave · Posted 10-31-2017 10:13 AM

If a cow can only be counted as infected if it was retested, then the ratio HerdConv/HerdRetest is a proportion rather than a rate with each cow being a Bernoulli trial and the number of cows of those retested is distributed binomial. This assumes that the set of cows retested are independent and have the same infection probability. As such, you would model this using logistic regression with one observation in your data set representing each set of cows retested. Any predictors would apply to the set of cows in each observation. Code would look like:

proc logistic; 
model herdconv/herdretest = x1 x2 ; 
run;

HHutch · Posted 10-31-2017 12:26 PM

Does this assume that each set of cows has the same infection probability or each cow within a set has the same probability. I believe I could say the probability is the same within but is not the same between.

Rick_SAS · Posted 10-31-2017 12:55 PM

if you think there is a "herd effect," then perhaps a mixed model where 'herd number' is a random effect.

HHutch · Posted 10-31-2017 01:03 PM

I have considered a mixed model incorporating a random herd effect. Since
all variables of interest are herd-level, would I need to make a
population-averaged/marginal model (PA) to accurately reflect the effects
of management factors that vary between herds? I am currently attempting to
do this with GENMOD until I figure out whether a count model is valid or
not. If I am making a PA model, would my repeated statement be "repeated
subject=herd" or "repeated subject=cow(herd)". There is only one
observation per cow and herd level is the interest. However, there are a
few instances where a "Cow ID" is the same number for two different cows on
two different herds. Again, thank you for all of your help!

StatDave · Posted 10-31-2017 01:44 PM

You asked:

Does this assume that each set of cows has the same infection probability or each cow within a set has the same probability.

The latter is assumed. Each set (observation) is considered a population by default and each population can have a different probability depending on the setting of the predictors. So, if you just have sets of cows and want to model differences in the probabilities among the sets, then you should be able to just fit a simple logistic model in PROC LOGISTIC. You do not need a mixed model or GEE (MA) model unless there are correlations among the sets to be accounted for.