Hi,
I'm modelling claims frequency by using proc genmod for a GLM with Poisson distribution. I was hoping that someone could please help me understand the "offset" term better and when it should and shouldn't be used?
The data is at a per-policy level as in the example below, so I am unsure whether or not I should include the offset term. Please help!
Policyholder | Gender | Age | Months insured | Total number of claims | Total amount of claims |
Peter | Male | 22 | 6 | 1 | $10 000 |
Sue | Female | 32 | 12 | 0 | 0 |
- Thank you
For a much simpler view, since I'm not sure how stats savvy you are, you want to use an offset with Poisson modeling when you are modeling a rate (concentration, density, etc.) instead of a count. The log of the denominator of your rate, ln(rateDenom), becomes your offset. As youtoub mentions, you need to create ln(rateDenom) in a dataset before running genmod.
If you are modelling how many cars pass various intersections, and you observed them all for the exact same amount of time, you are modeling a count, so no offset. If you observed each intersection for different lengths of time, you of course would want to test the rate (cars/hour) rather than total cars; now, you would need an offset of ln(hours) for each intersection.
If you are modeling total number of bacteria in a colony, and all your colonies have the same observation time and resources (size petri dish, etc), then you are modeling a count and do not need an offset. If you are observing a concentration of bacteria (cells/mg fluid), then you have a "rate" and need an offset of ln(total mg fluid).
In your example, I think you are measuring a rate (claims/month). In this case, you will need to have an offset of ln(months).
An offset term should be used when the model includes a term which should not be multiplied with any parameter.
Often in Poisson regression you will have an offset because meanvalue will be proportional to the time the observation is observed. That is also the case in your question.
Here I call the observation time PY (Person Years). Then the expected count is
λ =PY * exp(β X)=exp(log(PY)+β X)
Therefore, log(PY) is an offset in the model equation.
This is only true when you model the mean with a multiplicative meanstructure (log as linkfunction)
You should include the offset term in your model statement since time of insurance coverage "Insured Month" is not the same across customers.
Assume you are interested in estimating the rate of claim by month (subject-month). In a simple Poisson regression model: log(λ) = β X + log(time) + e
Using the GENMOD PROCEDURE:
data mydata;
set mydata;
log_time = log(Insured_Month);
run;
proc genmod data=mydata;
class gender;
model y = gender age / type3 dist = poisson offset = log_time;
run;
If you are interested in calcluating the incidence of claim by subject-year, calculate log_time as log(Insured_Month/12);
If your data appear to be overdispersed (i.e. more variation than expected under a poisson model E(Y) = VAR(Y) =λ) consider using a negative binomial distribution.
If the data appear to be zero-inflated (more zeros than expected under a poisson model) consider using Zero Poisson Inflated (ZIP) model.
For a much simpler view, since I'm not sure how stats savvy you are, you want to use an offset with Poisson modeling when you are modeling a rate (concentration, density, etc.) instead of a count. The log of the denominator of your rate, ln(rateDenom), becomes your offset. As youtoub mentions, you need to create ln(rateDenom) in a dataset before running genmod.
If you are modelling how many cars pass various intersections, and you observed them all for the exact same amount of time, you are modeling a count, so no offset. If you observed each intersection for different lengths of time, you of course would want to test the rate (cars/hour) rather than total cars; now, you would need an offset of ln(hours) for each intersection.
If you are modeling total number of bacteria in a colony, and all your colonies have the same observation time and resources (size petri dish, etc), then you are modeling a count and do not need an offset. If you are observing a concentration of bacteria (cells/mg fluid), then you have a "rate" and need an offset of ln(total mg fluid).
In your example, I think you are measuring a rate (claims/month). In this case, you will need to have an offset of ln(months).
I would like to emphasize that the offset only should be used in Poisson regression when the log is used as link function - which by the way is the default.
As I mentioned above, when log is used as linkfunction, then one has a multiplicative structure:
λ =time * exp(β X)=exp(log(time)+β X),
and it turns out that the log(time) should be used as offset.
But, if one use an additive mean structure, eg have identity as linkfunction, then the expected count is
λ =time * β X = β (time*X).
It turns out that one should not have log time as offset, but instead make the regression on the covariate vector multiplied with the time.
I am new to the SAS community and want to thank you very much for your responses, I really appreciate it and must say I am amazed by how helpful and knowledgeable you are! What I did to model the monthly frequency is use the data to create a "Frequency" column by dividing the number of claims by the exposure months. Please see below:
Policyholder | Gender | Age | Months insured | Total number of claims | Total amount of claims | Frequency |
Peter | Male | 22 | 6 | 1 | $10 000 | 0,16666667 |
Sue | Female | 32 | 12 | 0 | 0 | 0 |
So, using your formula JacobSimonsen:
λ =time * exp(β X)
I, in effect, divided both sides by time, to end up with a model for frequency:
λ/time =exp(β X)=frequency
where λ/time=frequency, and I therefore modelled frequency without the offset term, as below:
proc genmod data=mydata;
class gender;
model frequency = gender age /
dist = poisson
link = log
run;
So the effect of months on cover not being equal for all clients is taken account of by working out the frequency beforehand. And therefore the offset term is not necessary since I am modelling
λ/time =exp(β X). And this is still a Poisson distribution since if X - poi(λ), then X/t - poi(λ/t)
Is my logic correct?
Thanks in advance for your help!
If your are considering a poisson distribution your response y should the "total number of claims": Model Total Number of Claims = gender age / type3 dist = poisson link = log offset =log_time
I second youtoub. The Poisson distribution should have an integer response variable. Also, when you pre-divide, you are losing some information about that policy (namely, how long the policy has been active).
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.