BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
LTromp
Calcite | Level 5

Hi,

I'm modelling claims frequency by using proc genmod for a GLM with Poisson distribution.  I was hoping that someone could please help me understand the "offset" term better and when it should and shouldn't be used?

The data is at a per-policy level as in the example below, so I am unsure whether or not I should include the offset term.  Please help!

PolicyholderGenderAgeMonths insuredTotal number of claimsTotal amount of claims
PeterMale2261$10 000
SueFemale321200

- Thank you

1 ACCEPTED SOLUTION

Accepted Solutions
Kastchei
Pyrite | Level 9

For a much simpler view, since I'm not sure how stats savvy you are, you want to use an offset with Poisson modeling when you are modeling a rate (concentration, density, etc.) instead of a count.  The log of the denominator of your rate, ln(rateDenom), becomes your offset.  As youtoub mentions, you need to create ln(rateDenom) in a dataset before running genmod.

If you are modelling how many cars pass various intersections, and you observed them all for the exact same amount of time, you are modeling a count, so no offset.  If you observed each intersection for different lengths of time, you of course would want to test the rate (cars/hour) rather than total cars; now, you would need an offset of ln(hours) for each intersection.

If you are modeling total number of bacteria in a colony, and all your colonies have the same observation time and resources (size petri dish, etc), then you are modeling a count and do not need an offset.  If you are observing a concentration of bacteria (cells/mg fluid), then you have a "rate" and need an offset of ln(total mg fluid).

In your example, I think you are measuring a rate (claims/month).  In this case, you will need to have an offset of ln(months).

View solution in original post

7 REPLIES 7
JacobSimonsen
Barite | Level 11

An offset term should be used when the model includes a term which should not be multiplied with any parameter.

Often in Poisson regression you will have an offset because meanvalue will be proportional to the time the observation is observed. That is also the case in your question.

Here I call the observation time PY (Person Years). Then the expected count is

λ =PY * exp(β X)=exp(log(PY)+β X)

Therefore, log(PY) is an offset in the model equation.

This is only true when you model the mean with a multiplicative meanstructure (log as linkfunction)

youtoub
Fluorite | Level 6

You should include the offset term in your model statement since time of insurance coverage "Insured Month" is not the same across customers. 

Assume you are interested in estimating the rate of claim by month (subject-month). In a simple Poisson regression model: log(λ) = β X + log(time)  + e

Using the GENMOD PROCEDURE:

data mydata;

set mydata;

log_time = log(Insured_Month);

run;

proc genmod data=mydata;

class gender;

model y = gender age / type3 dist = poisson offset = log_time;

run;

If you are interested in calcluating the incidence of claim by subject-year, calculate log_time as  log(Insured_Month/12);

If your data appear to be overdispersed (i.e. more variation than expected under a poisson model E(Y) = VAR(Y) =λ)  consider using a negative binomial distribution.

If the data appear to be zero-inflated (more zeros than expected under a poisson model)  consider using Zero Poisson Inflated (ZIP) model.

Kastchei
Pyrite | Level 9

For a much simpler view, since I'm not sure how stats savvy you are, you want to use an offset with Poisson modeling when you are modeling a rate (concentration, density, etc.) instead of a count.  The log of the denominator of your rate, ln(rateDenom), becomes your offset.  As youtoub mentions, you need to create ln(rateDenom) in a dataset before running genmod.

If you are modelling how many cars pass various intersections, and you observed them all for the exact same amount of time, you are modeling a count, so no offset.  If you observed each intersection for different lengths of time, you of course would want to test the rate (cars/hour) rather than total cars; now, you would need an offset of ln(hours) for each intersection.

If you are modeling total number of bacteria in a colony, and all your colonies have the same observation time and resources (size petri dish, etc), then you are modeling a count and do not need an offset.  If you are observing a concentration of bacteria (cells/mg fluid), then you have a "rate" and need an offset of ln(total mg fluid).

In your example, I think you are measuring a rate (claims/month).  In this case, you will need to have an offset of ln(months).

JacobSimonsen
Barite | Level 11

I would like to emphasize that the offset only should be used in Poisson regression when the log is used as link function - which by the way is the default.

As I mentioned above, when log is used as linkfunction, then one has a multiplicative structure:

λ =time * exp(β X)=exp(log(time)+β X),

and it turns out that the log(time) should be used as offset.

But, if one use an additive mean structure, eg have identity as linkfunction, then the expected count is

λ =time * β X = β (time*X).

It turns out that one should not have log time as offset, but instead make the regression on the covariate vector multiplied with the time.

LTromp
Calcite | Level 5

I am new to the SAS community and want to thank you very much for your responses, I really appreciate it and must say I am amazed by how helpful and knowledgeable you are!  What I did to model the monthly frequency is use the data to create a "Frequency" column by dividing the number of claims by the exposure months. Please see below:

PolicyholderGenderAgeMonths insuredTotal number of claimsTotal amount of claimsFrequency
PeterMale2261$10 0000,16666667
SueFemale3212000

So, using your formula JacobSimonsen:

          λ =time * exp(β X)

I, in effect, divided both sides by time, to end up with a model for frequency:

          λ/time =exp(β X)=frequency

where λ/time=frequency, and I therefore modelled frequency without the offset term, as below:


proc genmod data=mydata;

class  gender;

model frequency = gender age /

dist = poisson

link = log

run;

So the effect of months on cover not being equal for all clients is taken account of by working out the frequency beforehand.  And therefore the offset term is not necessary since I am modelling

λ/time =exp(β X).  And this is still a Poisson distribution since if X - poi(λ), then X/t - poi(λ/t)


Is my logic correct?

Thanks in advance for your help!

youtoub
Fluorite | Level 6

If your are considering a poisson distribution your response y should the "total number of claims": Model Total Number of Claims = gender age / type3 dist = poisson link = log offset =log_time

Kastchei
Pyrite | Level 9

I second youtoub.  The Poisson distribution should have an integer response variable.  Also, when you pre-divide, you are losing some information about that policy (namely, how long the policy has been active).

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 22402 views
  • 19 likes
  • 4 in conversation