BookmarkSubscribeRSS Feed
RyanSimmons
Pyrite | Level 9

I am fitting a ZINB model in PROC GENMOD. I have noticed some odd behavior with the way the model is fit with respect to the offset term that I was hoping people here could help clarify.

 

The outcome of interest in the study are counts of acts of unprotected sex. I am offsetting this outcome in the ZINB model by the total count of sex acts. To put it another way, the outcome is the rate or proportion of unprotected sex. There is a non-trivial subset (about 77 individuals) of my study population (N=474), however, that had no sex at all during the specified time frame. That is, both their outcome AND their offset term are equal to 0.

 

First, I fit a model where those 77 individuals had their outcomes set to missing (and they were thus excluded from the analysis). The code looks like this (where TI is the offset variable of interest, and UAVI is the outcome variable of interest):

 

DATA test1;
     set dat_analysis;
     logTI = log(TI);
run;

/* SAS sets values of logT to missing where TI=0 */

PROC GENMOD data=test1;
     class Treatment(ref="Control") / param=ref;
     model UAVI = Treatment / dist=zinb offset=logTI;
     zeromodel;
run;

 

Here is some output from that model:

 

Number of Observations Read: 474
Number of Observations Used: 396
Missing Values: 78

Intercept Estimate = -0.6249
Intercept St. Err: 0.0619
Treatment Estimate = 0.0915
Treatment St. Err: 0.0865

 

 I then ran the following code, where I manually set the offset terms to 0 instead of missing:

DATA test4;
     set dat_analysis;
     if TI>0 then logTI=log(TI);
if TI=0 then logTI=0; run; PROC GENMOD data=test4; class Treatment(ref="Control") / param=ref; model UAVI = Treatment / dist=zinb offset=logTI; zeromodel; run;

In this case, my results are slightly different. Some output:

Number of Observations Read: 474
Number of Observations Used: 473
Missing Values: 1

Intercept Estimate = -0.6554
Intercept St. Err: 0.0623
Treatment Estimate = 0.0713
Treatment St. Err: 0.0855

Now, why are these results different? If the offset term is set to 0, then those individuals have a rate of 0/0. I would think that SAS would ignore those cases, because it makes no mathematical sense, but clearly SAS IS incorporating that information into the model. But how? What is SAS doing, here?

 

Then, I fit two more models where I imputed a value for the offset term. One of the models I replaced each 0 of the offset with a very small value (0.001) and the other model I replaced it with a large value (1000). Since all individuals with a 0 offset also had a 0 on the outcome of interest (by definition), I figured that the value of the offset would be irrelevant, and I would get analagous results. However, this turned out to be incorrect:

  

DATA test2;
     set dat_analysis;
     if TI=0 then TI=0.001;
logTI = log(TI); run;
/* MODEL 1 */ PROC GENMOD data=test2; class Treatment(ref="Control") / param=ref; model UAVI = Treatment / dist=zinb offset=logTI; zeromodel; run;

DATA test3;
     set dat_analysis;
     if TI=0 then TI=1000;
     logTI = log(TI);
run;

/* MODEL 2 */
PROC GENMOD data=test3;
     class Treatment(ref="Control") / param=ref;
     model UAVI = Treatment / dist=zinb offset=logTI;
     zeromodel;
run;

 

Number of Observations Read: 474
Number of Observations Used: 473
Missing Values: 1

MODEL 1: Intercept Estimate = -0.6250 Intercept St. Err: 0.0619 Treatment Estimate = 0.0915 Treatment St. Err: 0.0864

MODEL 2:
Intercept Estimate = -0.5763
Intercept St. Err: 0.0586
Treatment Estimate = 0.0815
Treatment St. Err: 0.0831

Now, admittedly, the differences between the models are small, but I don't understand why there are differences at all. In all of these cases, the individuals whose offset terms have been modified have 0 outcomes, so in all cases they are being modelled with a rate of 0. So shouldn't these models all be equivalent/

 

For the model in which I set the 0 offsets to a 0.001, the results are equivalent to the model that ignored those observations entirely (i.e. they were set to missing). The cases where I set the log offset to 0 manually and where I gave the 0 offsets a value of 1000 each gave different results from either of the other models.

 

How can we explain these results? And what is the most principled method for dealing with this case with 0 offsets?

6 REPLIES 6
RyanSimmons
Pyrite | Level 9
I just realized the reason that the model with the 0 offsets set to 0.001 is equivalent to the model with the missing values. log(0.001)=-3. Since negative counts aren't possible, they are being ignored by the model. Still, I find it odd that SAS doesn't give any warning or note in its output that tells you that invalid count numbers were encountered and dropped from the model. It still says that it read "473" observations, which clearly isn't the case. But I still don't understand why the other models give such incompatible results.
lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12

I am not sure where the problem lies. I just tried a similar approach and all worked just as expected. If you use an offset of missing (that is, log(0)), that observation is indeed missing in the analysis. I get the same results by removing the missing offset observations or by using simply using missing (.) for the offset. Using missing offset, GENMOD correctly indicates that some values are missing. I was using a Poisson distribution. I think your problem is with the zinb distribution. As far as I can tell, the offset is used only in the model defined by the MODEL statement. I have not investigated this, but I am guessing that the link is not used with the ZEROMODEL If that is the case, the observations with offset=. would still be used for some of the model fit.

lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12

I just tried my example with a ZIP model and I get the same results if I take out the observations with a missing for the offset or if I just using the missing value for the offset. So, I don't see a problem.

A response count of 0 is a perfectly valid number for the Poisson or Negative binomial; it is only the offset that matters here.

RyanSimmons
Pyrite | Level 9

I think you slightly misunderstood my question. Of course the results will be the same when you set the values to missing versus if you remove them from the dataset entirely, but that wasn't the discrepency. PROC GENMOD handles missingness the way I expect it to. 

 

My question is what happens WHEN THE OFFSET IS MANUALLY SET TO 0. As you can see in the code I provided in the OP:

 

DATA test4;
     set dat_analysis;
     if TI>0 then logTI=log(TI);
     if TI=0 then logTI=0;
run;

And the accompanying text:

 

 

"If the offset term is set to 0, then those individuals have a rate of 0/0. I would think that SAS would ignore those cases, because it makes no mathematical sense, but clearly SAS IS incorporating that information into the model. But how? What is SAS doing, here?"

 

That's the issue. When you define an offset, you are inherently defining a rate. Yet a rate of 0/0 makes no mathematical sense, but SAS is able to fit the model to those observations. But I don't really understand HOW it is doing this.

lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12

An offset of 0 is a perfectly valid value for the offset. With a log link, it is interpred on a log scale. Log(x)=0 is perfectly valid, because it means that x=1. So you are modeling 0/1, not 0/0.

RyanSimmons
Pyrite | Level 9

Ah, crap, you're right. I kept forgetting that the 0 was on the log scale, so by setting those to 0 I am inherently giving them a non-transformed value of 1 for the offset. My mistake! I was getting confused in my head about where the log was being applied.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1931 views
  • 2 likes
  • 2 in conversation