SAS Communities Library

We’re smarter together. Learn from this collection of community knowledge and add your expertise.
BookmarkSubscribeRSS Feed

Making Zero-Inflation Count

Started 2 weeks ago by
Modified 2 weeks ago by
Views 223

 

In a previous discussion, we have already discussed the usage of PROC GENMOD to expand our regression techniques to allow for more distributions beyond just normal. In this post, let’s discuss another use of PROC GENMOD called zero-inflation. Within PROC GENMOD, we can bring zero-inflation to both a Poisson and to a Negative Binomial distribution. Let’s talk about these mixture distributions and how they can be useful to our analysis.

 

Using Poisson or Negative Binomial

Have you ever had a response variable that was a count variable where the support was non-negative integers? Have you wanted to model calls to a call center, visits to an emergency room, or number of roots germinating from a plant? Each of these situations has a response variable that can never be negative, could be zero, and could be a positive integer value. These types of variables are not normally distributed thus cannot be used within standard linear regression. They are also not a dichotomous response and cannot use logistic regression. Say hello to what I like to say is the next most popular type of regression, count regression.

 

Within PROC GENMOD, we can use either Poisson or Negative Binomial as our distribution with the log link for these count regressions.

 

Poisson vs Negative Binomial

What would be the reason to use a Poisson distribution versus a Negative Binomial? To put it simply, the Negative Binomial is a more general distribution type that includes a Poisson as a special case. A Poisson distribution has the property that the mean and variance of the variable are equal. The Negative Binomial does not have this requirement. Many, when they first learn about count regression, go directly to the Poisson distribution for their analysis. This can be problematic if an exploration into the mean and variance is omitted. But when should you worry about this issue? Let’s focus on when the collected data for your count response variable contains excess zeros.

 

What is Zero-Inflation?

But I thought that zeros were allowed within the Poisson distribution. You are correct. However, there are times when the number of zeros collected in the data exceed the number of zeros that would be expected if the response variable was truly a Poisson distribution. This inclusion of excess zeros is what we call zero-inflation. The presence of these extra zeros imbalance the mean and variance resulting in an issue called overdispersion. This is the variance of the variable being larger than what is expected.

 

01_damodl_blog2_counthisto.png

 

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

The good news is that the Negative Binomial is a good alternative for dealing with overdispersion. In fact, many analysts have started going directly to Negative Binomial for count data rather than immediately to the Poisson distribution.

 

But what if you were interested in trying to determine what aspects of your problem could be causing the excess zeros? Could you create a model that would help focus our attention on causes of the excess zeros? This is where the zero-inflated distributions are effective.

 

Modeling with Zero-Inflation

Let’s look at an example to understand how zero-inflation modeling works.

 

02_damodl_blog2_examplesetup.png

 

In our example, we are scientists that are looking into a new fertilizer and its effect in germinating roots on apple tree saplings. In an experiment, we look at four levels of concentration of the fertilizer and two levels of photo exposure. We then count the number of roots germinated from the saplings. We can see that this is a count regression but let’s see if a zero-inflation structure could help.

 

03_damodl_blog2_examplehisto.png

 

Look at that spike on the zero. If this were a true Poisson distribution, we would expect the zero-bar height to be closer to two percent. In our situation, we have much more than that.

 

Let’s get into some details about zero-inflation modeling. Both the zero-inflated Poisson and zero-inflated Negative Binomial are mixture distributions found within PROC GENMOD. Full disclosure, neither of these zero-inflated distributions are members of the exponential family but their mixture does contain a member of the exponential family. The zero-inflated Poisson distribution is a mixture of the Poisson distribution and a point mass at zero. When a response value is non-zero, we know that value is from the Poisson distribution. If the response value is zero, it could either be from the Poisson distribution or from the point mass at zero. Wouldn’t it be nice to model the probability that the zero value is from the point mass and not the Poisson? That sounds like a logistic regression question. Well, that is basically our mixture.

 

Within the code, we will be able to generate two models simultaneously. One will be for the Poisson component (MODEL) and the other will be for the logistic component (ZEROMODEL).

 

Let’s return to our apple example. Do you have a guess as to why we are having excess zeros in our data? Could it be due to the poisoning of the plant with high concentrations of fertilizer? Could it be that the plant was cooked after being exposed to a longer than typical amount of light? Could it even be a combination of the two? This is the purpose of the ZEROMODEL part of the code. We place in this line the potential variables that we suspect could contribute to the extra zeros. In the MODEL line, we place the variables that we think affect the Poisson side of the story. It is important to note that you can place the same variables in both the MODEL and ZEROMODEL line simultaneously. Let’s proceed with the thought that it was the photo period that is causing the extra zeros.

 

Example

Let’s start with modeling our data directly with a Poisson distribution.

 

proc genmod data=sasuser.roots;
model roots = photo bap photo_bap / dist=poi link=log;
run;quit;

 

04_damodl_blog2_poissonGOF.png

 

Focusing our attention on the goodness of fit statistics, the scaled Pearson chi-square divided by its degrees of freedom indicates that we may have an issue. This value should be closer to 1 rather than the 2.8453 that we have here. This in conjunction with the mean and variance mismatch is showing an issue with overdispersion. Let’s look at moving to the Negative Binomial.

 

proc genmod data=sasuser.roots;
model roots = photo bap photo_bap / dist=nb link=log;
run;quit;

 

05_damodl_blog2_negbinGOF.png

 

Looking at the scaled Pearson chi-squared divided by its degrees of freedom, we see that under the Negative Binomial we are now closer to the value of 1. This is indicating that the Negative Binomial is fitting our data, accounting for the overdispersion, better than the Poisson. We can also see that our information criteria are smaller in the Negative Binomial case.

 

Now let’s take this Negative Binomial viewpoint and move to zero-inflation.

 

proc genmod data=sasuser.roots;
model roots = photo bap photo_bap / dist=zinb;
zeromodel photo / link=logit;
run;quit;

 

06_damodl_blog2_zinbGOF.png

 

07_damodl_blog2_zinbmodelparms.png

 

08_damodl_blog2_zinbzeromodelparms.png

 

The ZEROMODEL line contains the effects that we believe could contribute to the additional zeros in our collected data. If we thought that other items could explain this excess, we could include them in the ZEROMODEL line.

 

Now that we have investigated and discussed zero-inflation count data, give zero-inflation a try with your data analysis and see if it may help with your count data.

 

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
2 weeks ago
Updated by:
Contributors

sas-innovate-white.png

Join us for our biggest event of the year!

Four days of inspiring keynotes, product reveals, hands-on learning opportunities, deep-dive demos, and peer-led breakouts. Don't miss out, May 6-9, in Orlando, Florida.

 

View the full agenda.

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags