Re: Behavior of offset term in PROC GENMOD for ZINB model

RyanSimmons · Posted 12-03-2015 01:14 PM

I am fitting a ZINB model in PROC GENMOD. I have noticed some odd behavior with the way the model is fit with respect to the offset term that I was hoping people here could help clarify.

The outcome of interest in the study are counts of acts of unprotected sex. I am offsetting this outcome in the ZINB model by the total count of sex acts. To put it another way, the outcome is the rate or proportion of unprotected sex. There is a non-trivial subset (about 77 individuals) of my study population (N=474), however, that had no sex at all during the specified time frame. That is, both their outcome AND their offset term are equal to 0.

First, I fit a model where those 77 individuals had their outcomes set to missing (and they were thus excluded from the analysis). The code looks like this (where TI is the offset variable of interest, and UAVI is the outcome variable of interest):

DATA test1;
     set dat_analysis;
     logTI = log(TI);
run;

/* SAS sets values of logT to missing where TI=0 */

PROC GENMOD data=test1;
     class Treatment(ref="Control") / param=ref;
     model UAVI = Treatment / dist=zinb offset=logTI;
     zeromodel;
run;

Here is some output from that model:

Number of Observations Read: 474
Number of Observations Used: 396
Missing Values: 78

Intercept Estimate = -0.6249
Intercept St. Err: 0.0619
Treatment Estimate = 0.0915
Treatment St. Err: 0.0865

I then ran the following code, where I manually set the offset terms to 0 instead of missing:

DATA test4;
     set dat_analysis;
     if TI>0 then logTI=log(TI);
     if TI=0 then logTI=0;
run;

PROC GENMOD data=test4;
     class Treatment(ref="Control") / param=ref;
     model UAVI = Treatment / dist=zinb offset=logTI;
     zeromodel;
run;

In this case, my results are slightly different. Some output:

Number of Observations Read: 474
Number of Observations Used: 473
Missing Values: 1

Intercept Estimate = -0.6554
Intercept St. Err: 0.0623
Treatment Estimate = 0.0713
Treatment St. Err: 0.0855

Now, why are these results different? If the offset term is set to 0, then those individuals have a rate of 0/0. I would think that SAS would ignore those cases, because it makes no mathematical sense, but clearly SAS IS incorporating that information into the model. But how? What is SAS doing, here?

Then, I fit two more models where I imputed a value for the offset term. One of the models I replaced each 0 of the offset with a very small value (0.001) and the other model I replaced it with a large value (1000). Since all individuals with a 0 offset also had a 0 on the outcome of interest (by definition), I figured that the value of the offset would be irrelevant, and I would get analagous results. However, this turned out to be incorrect:

DATA test2;
     set dat_analysis;
     if TI=0 then TI=0.001;
     logTI = log(TI);
run;

/* MODEL 1 */
PROC GENMOD data=test2;
     class Treatment(ref="Control") / param=ref;
     model UAVI = Treatment / dist=zinb offset=logTI;
     zeromodel;
run;

DATA test3;
     set dat_analysis;
     if TI=0 then TI=1000;
     logTI = log(TI);
run;

/* MODEL 2 */
PROC GENMOD data=test3;
     class Treatment(ref="Control") / param=ref;
     model UAVI = Treatment / dist=zinb offset=logTI;
     zeromodel;
run;

Number of Observations Read: 474
Number of Observations Used: 473
Missing Values: 1

MODEL 1:
Intercept Estimate = -0.6250
Intercept St. Err: 0.0619
Treatment Estimate = 0.0915
Treatment St. Err: 0.0864

MODEL 2:
Intercept Estimate = -0.5763
Intercept St. Err: 0.0586
Treatment Estimate = 0.0815
Treatment St. Err: 0.0831

Now, admittedly, the differences between the models are small, but I don't understand why there are differences at all. In all of these cases, the individuals whose offset terms have been modified have 0 outcomes, so in all cases they are being modelled with a rate of 0. So shouldn't these models all be equivalent/

For the model in which I set the 0 offsets to a 0.001, the results are equivalent to the model that ignored those observations entirely (i.e. they were set to missing). The cases where I set the log offset to 0 manually and where I gave the 0 offsets a value of 1000 each gave different results from either of the other models.

How can we explain these results? And what is the most principled method for dealing with this case with 0 offsets?

RyanSimmons · Posted 12-03-2015 01:15 PM

I just realized the reason that the model with the 0 offsets set to 0.001 is equivalent to the model with the missing values. log(0.001)=-3. Since negative counts aren't possible, they are being ignored by the model. Still, I find it odd that SAS doesn't give any warning or note in its output that tells you that invalid count numbers were encountered and dropped from the model. It still says that it read "473" observations, which clearly isn't the case. But I still don't understand why the other models give such incompatible results.

lvm · Posted 12-03-2015 05:20 PM

I am not sure where the problem lies. I just tried a similar approach and all worked just as expected. If you use an offset of missing (that is, log(0)), that observation is indeed missing in the analysis. I get the same results by removing the missing offset observations or by using simply using missing (.) for the offset. Using missing offset, GENMOD correctly indicates that some values are missing. I was using a Poisson distribution. I think your problem is with the zinb distribution. As far as I can tell, the offset is used only in the model defined by the MODEL statement. I have not investigated this, but I am guessing that the link is not used with the ZEROMODEL If that is the case, the observations with offset=. would still be used for some of the model fit.

lvm · Posted 12-03-2015 05:25 PM

I just tried my example with a ZIP model and I get the same results if I take out the observations with a missing for the offset or if I just using the missing value for the offset. So, I don't see a problem.

A response count of 0 is a perfectly valid number for the Poisson or Negative binomial; it is only the offset that matters here.

RyanSimmons · Posted 12-07-2015 11:12 AM

I think you slightly misunderstood my question. Of course the results will be the same when you set the values to missing versus if you remove them from the dataset entirely, but that wasn't the discrepency. PROC GENMOD handles missingness the way I expect it to.

My question is what happens WHEN THE OFFSET IS MANUALLY SET TO 0. As you can see in the code I provided in the OP:

DATA test4;
     set dat_analysis;
     if TI>0 then logTI=log(TI);
     if TI=0 then logTI=0;
run;

And the accompanying text:

"If the offset term is set to 0, then those individuals have a rate of 0/0. I would think that SAS would ignore those cases, because it makes no mathematical sense, but clearly SAS IS incorporating that information into the model. But how? What is SAS doing, here?"

That's the issue. When you define an offset, you are inherently defining a rate. Yet a rate of 0/0 makes no mathematical sense, but SAS is able to fit the model to those observations. But I don't really understand HOW it is doing this.

lvm · Posted 12-07-2015 12:09 PM

An offset of 0 is a perfectly valid value for the offset. With a log link, it is interpred on a log scale. Log(x)=0 is perfectly valid, because it means that x=1. So you are modeling 0/1, not 0/0.

RyanSimmons · Posted 12-07-2015 12:36 PM

Ah, crap, you're right. I kept forgetting that the 0 was on the log scale, so by setting those to 0 I am inherently giving them a non-transformed value of 1 for the offset. My mistake! I was getting confused in my head about where the log was being applied.

Behavior of offset term in PROC GENMOD for ZINB model