Fitting Tweedie’s Compound Poisson-Gamma Mixture Model by Using PROC HPGENSELECT

Overview

Semicontinuous random variables are characterized by a continuous distribution that has point masses at one or more locations. One way to model semicontinuous data is to fit a generalized linear model by using a Tweedie distribution for the response variable. Tweedie distributions have been used in such diverse fields as actuarial science, economics, telecommunications, ecology, medicine, and meteorology. This example illustrates how to fit a Tweedie model to aggregate insurance claims payments data by using the HPGENSELECT procedure which is available in SAS/STAT 12.3, recently released with SAS 9.4.

Analysis

Exponential dispersion models are the response distributions for generalized linear models. Any exponential dispersion model can be characterized by its variance function V(), which describes the mean-variance relationship of the distribution when the dispersion is held constant. If Y follows an exponential dispersion model distribution that has mean μ, variance function V(), and dispersion φ, then the variance of Y can be written as

V(Y) = φV(μ)

Tweedie distributions are a special case of the exponential dispersion family for which V(μ) = μ^p and V(Y) = φ μ^p and (Dunn and Smyth, 2005). The distribution is defined for all values of p except values of p in the open interval (0, 1). Many important known distributions are a special case of Tweedie distributions including normal (p = 0), Poisson (p = 1), gamma (p = 2), and inverse Gaussian (p = 3). Apart from these special cases, the probability density unction of the Tweedie distribtion does not have an analyticl expression. For p > 1, it has the form

where for p ≠ 2 and for p = 2. The function a(y, φ) does not have an analytical expression. It is usually evaluated by using the series expansion methods that are described in Dunn and Smyth (2005).

For 1 < p < 2, the Tweedie distribution is a compound Poisson-gamma mixture distribution, which is the distribution of S defined as

where and are independently and identically distributed gamma random variables with the shape parameter α and the scale parameter θ. At Y = 0, the density is a probability mass that is governed by the Poisson distribution, and for values of Y > 0, the density is a mixture of gamma variates with Poisson mixing probability. The parameters λ, α, and θ are related to the natural parameters μ, φ, and p of the Tweedie distribution as

The mean of a Tweedie distribution is positive for p > 1.

Example: Modeling Insurance Claims Data

When modeling aggregate payments from insurance claims, if you assume that the arrival of claims follows a Poisson distribution, that the size of individual claims are independently and identically gamma distributed, and that the arrival and sizes are independent of one another, then the aggregate payments follow a Tweedie compound Poisson-gamma mixture distribution (Frees, 2010). This example uses PROC HPGENSELECT to fit a Tweedie model to the aggregate loss data from a Swedish study about third-party automobile insurance claims for 1977. The data were compiled by the Swedish Committee on the Analysis of Risk Premium in Motor Insurance (Andrews and Herzberg, 1985). The following SAS statements create the data set MotorIns, and Table 1 describes the variables:

data motorins;
  input Kilometres Zone Bonus Make Insured Claims Payment;
  LogInsured=log(insured);
  Zeros=ifn(payment ne 0,1,0);
  datalines;
1 1 1 1 455.13 108 392491
1 1 1 2 69.17 19 46221
1 1 1 3 72.88 13 15694
1 1 1 4 1292.39 124 422201
1 1 1 5 191.01 40 119373
1 1 1 6 477.66 57 170913

   ... more lines ...

5 7 7 8 13.06 0 0
5 7 7 9 384.87 16 112252
;
run;

Table 1: Example Data Set MotorIns

Variable	Type	Description
Kilometres	Class	Distance traveled
Zone	Class	Geographical zone
Make	Class	Make of automobile
Bonus	Continuous	No-claims bonus
Insured	Continuous	Number of insured drivers (years 100,000)
LogInsured	Continuous	Natural logarithm of Insured
Claims	Continuous	Number of insurance claims
Payment	Continuous	Sum of insurance claims payments, in Swedish kronor
Zeros	Binary	Indicator variable for Payment not equal to 0

Table 2 describes the levels of the classification variable Kilometres.

Table 2: Values and Labels of Kilometres

Value	Label
1	Less than 1,000 km per year
2	1,000–15,000 km per year
3	15,000–20,000 km per year
4	20,000–25,000 km per year
5	More than 25,000 km per year

Table 3 describes the levels of the classification variable Zone. The zones are given from a detailed investigation of 100 areas in 1972 and represent combinations of traffic intensity, state of roads, climatic differences, and so on (Andrews and Herzberg, 1985).

Table 3: Values and Labels of Zone

Value	Label
1	Stockholm, Göteborg, Malmö with surroundings
2	Other large cities with surroundings
3	Smaller cities with surroundings in southern Sweden
4	Rural areas in southern Sweden
5	Smaller cities with surroundings in northern Sweden
6	Rural areas in northern Sweden
7	Gotland

The models of cars are classified into 10 premium classes, but in a special investigation for 1977 eight common pure models were chosen and the rest were put in a combined class for reference (Andrews and Herzberg, 1985). The levels 1–8 of the classification variable Make represent the eight pure models, and level 9 is the combined class.

The variable Bonus is a measure of individual claim history. The insured motorist starts in the class Bonus = 1. Every year that no claim is filed, the insured moves up one class (Andrews and Herzberg, 1985).

Figure 1 shows a histogram of the response variable Payment and the proportion of zeros. The variable Payment exhibits a distribution that is fairly typical of semicontinuous variables: a significant density mass at zero and a continuous, right-skewed distribution elsewhere.

Figure 1: Distribution of Payment

Output 1: Frequency of Zeros

The FREQ Procedure

Zeros	Frequency	Percent	Cumulative Frequency	Cumulative Percent
0	385	17.64	385	17.64
1	1797	82.36	2182	100.00

The following SAS statements fit a Tweedie compound Poisson-gamma mixture model to the response variable Payment.

The CLASS statement specifies that the variables Kilometres, Zone, and Make are categorical variables. The SPLIT option requests that the columns of the design matrix that correspond to any effect that contains a split classification variable be able to be selected to enter or leave a model independently of the other design columns of that effect. The PARAM= option specifies a reference cell encoding for the classification variables.

The MODEL statement specifies that the response variable have a Tweedie distribution with a log link function. The candidates for the linear predictor include the main effects and the interactions between the classification variables Kilometres, Zone, and Make and the continuous variable Bonus. The OFFSET= option specifies that the variable LogInsured be included in the linear predictor with a coefficient of 1.

The SELECTION statement requests that stepwise selection be used and that the final model be chosen based on the AICC criterion. The DETAILS=SUMMARY option requests that only a summary of the selection process be displayed rather than the details from each step of the selection process.

The OUTPUT statement requests that the selected model’s prediction and residuals be saved to the SAS data set Tweedie.

The ID statement requests that the variable Payment also be included in the output data set.

proc hpgenselect data=motorins;
  class Kilometres Zone Make / split param=reference;
  model payment = Kilometres|Zone|Make|Bonus /
        dist=tweedie link=log offset=loginsured;
  selection method=stepwise(choose=aicc) details=summary;
  output out=tweedie P R;
  id payment;
run;

The “Performance Information” table in Output 2 shows that the procedure executed in single-machine mode (that is, on the server where SAS is installed). When high-performance procedures run in single-machine mode, they use concurrently scheduled threads. In this case, four threads were used.

The “Model Information” table reports that a Tweedie model was fit with a log link function.

The “Selection Information” table reports that stepwise selection was used, the selection and stopping criteria are the significance level of each individual effect, the entry significance level is the default value of 0.05, and the choose criterion is AICC.

Output 2: Performance, Model, and Selection Information

The HPGENSELECT Procedure

Performance Information
Execution Mode	Single-Machine
Number of Threads	4

Model Information
Data Source	WORK.MOTORINS
Response Variable	Payment
Offset Variable	LogInsured
Class Parameterization	Reference
Distribution	Tweedie
Link Function	Log
Optimization Technique	Quasi-Newton

Selection Information
Selection Method	Stepwise
Select Criterion	Significance Level
Stop Criterion	Significance Level
Choose Criterion	AICC
Effect Hierarchy Enforced	None
Entry Significance Level (SLE)	0.05
Stay Significance Level (SLS)	0.05
Stop Horizon	1

Output 3 shows that the sample size is 2,182, and the “Class Level Information” table shows the number of levels and the level values of the three classification variables.

Output 3: Sample Size and Classification Variable Levels

Number of Observations Read	2182
Number of Observations Used	2182

Class Level Information
Class	Levels		Reference Value	Values
Kilometres	5	*	5	1 2 3 4 5
Zone	7	*	7	1 2 3 4 5 6 7
Make	9	*	9	1 2 3 4 5 6 7 8 9

* Associated Parameters Split

The “Selection Summary” table in Output 4 reports the variables that are added at each step of the selection process. The summary shows that 30 effects plus an intercept were selected and that the selection process terminated because the sequence of effect additions and removals began cycling.

Output 4: Effect Selection Summary

The HPGENSELECT Procedure

Selection Summary
Step	Effect Entered	Effect Removed	Number Effects In	AICC	p Value
0	Intercept		1	44299.3018	.
1	Bonus		2	43694.2249	<.0001
2	Zone_1		3	43576.8235	<.0001
3	Kilometres_1		4	43414.2670	<.0001
4	Make_4		5	43251.5268	<.0001
5	Make_6		6	43186.2263	<.0001
6	Bonus*Kilometres_2		7	43128.0129	<.0001
7	Zone_5*Make_8		8	43103.4200	<.0001
8	BonusKilometres_2Zone_1*Make_8		9	43087.3135	<.0001
9	Zone_2		10	43057.5628	<.0001
10	Bonus*Kilometres_3		11	43035.8693	<.0001
11	Make_5		12	43021.0275	<.0001
12	Kilometres_1*Make_1		13	43010.8831	0.0003
13	Make_8		14	43000.7509	0.0003
14	Make_2		15	42990.3324	0.0003
15	BonusKilometres_4Zone_6*Make_4		16	42987.4687	0.0007
16	Kilometres_1Zone_5Make_7		17	42981.7743	0.0008
17	Kilometres_1Zone_6Make_7		18	42976.6767	0.0019
18	Bonus*Zone_5		19	42969.6789	0.0020
19	Bonus*Zone_3		20	42961.8306	0.0015
20	BonusKilometres_1Make_1		21	42954.6503	0.0019
21	Kilometres_4		22	42945.4183	0.0008
22	Make_1		23	42937.8402	0.0017
23	Bonus*Make_5		24	42931.7907	0.0039
24	Bonus*Kilometres_1		25	42925.8663	0.0045
25	Zone_2*Make_6		26	42920.2139	0.0075
26	BonusKilometres_1Zone_6		27	42915.7147	0.0082
27	BonusKilometres_3Zone_5*Make_2		28	42913.6024	0.0183
28	BonusKilometres_4Zone_1*Make_5		29	42911.4418	0.0190
29	Zone_1*Make_2		30	42908.4035	0.0202
30	BonusZone_1Make_2		31	42904.3219*	0.0099
31		Kilometres_1Zone_6Make_7	30	42907.6359	0.0534

* Optimal Value of Criterion

Stepwise selection stopped because the sequence of effect additions and removals is cycling.

The model at step 30 is selected where AICC is 42904.32.

The “Selected Effects” note in Output 5 lists the effects that are selected for the final model. The “Dimensions” table reports that 31 effects are included in the final model and 33 parameters are estimated. The “Fit Statistics” table reports that the value of AICC for the final model is 42,904.

Output 5: Selected Effects, Dimensions, Convergence Status, and Fit Statistics

Selected Effects:

Intercept Kilometres_1 Kilometres_4 Zone_1 Zone_2 Make_1 Make_2 Make_4 Make_5 Make_6 Make_8 Kilometres_1*Make_1 Zone_1*Make_2 Zone_2*Make_6 Zone_5*Make_8 Kilometres_1*Zone_5*Make_7 Kilometres_1*Zone_6*Make_7 Bonus Bonus*Kilometres_1 Bonus*Kilometres_2 Bonus*Kilometres_3 Bonus*Zone_3 Bonus*Zone_5 Bonus*Kilometres_1*Zone_6 Bonus*Make_5 Bonus*Kilometres_1*Make_1 Bonus*Zone_1*Make_2 Bonus*Kilometres_2*Zone_1*Make_8 Bonus*Kilometres_3*Zone_5*Make_2 Bonus*Kilometres_4*Zone_1*Make_5 Bonus*Kilometres_4*Zone_6*Make_4

Dimensions
Number of Effects	31
Number of Effects after Splits	31
Number of Parameters	33
Columns in X	31

Fit Statistics
-2 Log Likelihood	42837
AIC (smaller is better)	42903
AICC (smaller is better)	42904
BIC (smaller is better)	43091
Pearson Chi-Square	1015375
Pearson Chi-Square/DF	472.04777

Convergence criterion (GCONV=1E-8) satisfied.

Output 6 displays the estimates of the model parameters. The estimate of the dispersion parameter, φ, is 349.65 and the estimate of the power, p, is 1.36. The effect of using the SPLIT option in the CLASS statement is apparent. None of the classification variables have all their main effects or complete sets of interactions included in the model. The result is a more parsimonious model than you would achieve without enabling the design columns to enter and leave the model independently.

Output 6: Parameter Estimates

Parameter Estimates
Parameter	DF	Estimate	Standard Error	Chi-Square	Pr > ChiSq
Intercept	1	6.422717	0.030076	45604.6548	<.0001
Kilometres_1	1	-0.396625	0.050252	62.2944	<.0001
Kilometres_4	1	-0.163970	0.038900	17.7679	<.0001
Zone_1	1	0.431027	0.028089	235.4780	<.0001
Zone_2	1	0.234752	0.028513	67.7853	<.0001
Make_1	1	0.102037	0.031974	10.1842	0.0014
Make_2	1	0.104421	0.048899	4.5601	0.0327
Make_4	1	-0.676695	0.051711	171.2483	<.0001
Make_5	1	0.467489	0.091598	26.0480	<.0001
Make_6	1	-0.200767	0.039391	25.9776	<.0001
Make_8	1	0.240254	0.056656	17.9826	<.0001
Kilometres_1*Make_1	1	0.443491	0.132307	11.2358	0.0008
Zone_1*Make_2	1	0.728101	0.202768	12.8939	0.0003
Zone_2*Make_6	1	-0.286627	0.101679	7.9464	0.0048
Zone_5*Make_8	1	0.520894	0.160295	10.5599	0.0012
Kilometres_1Zone_5Make_7	1	0.736407	0.250920	8.6132	0.0033
Kilometres_1Zone_6Make_7	1	0.453568	0.230985	3.8558	0.0496
Bonus	1	-0.138215	0.006689	426.9209	<.0001
Bonus*Kilometres_1	1	-0.035637	0.010883	10.7225	0.0011
Bonus*Kilometres_2	1	-0.069006	0.006272	121.0511	<.0001
Bonus*Kilometres_3	1	-0.045452	0.006450	49.6621	<.0001
Bonus*Zone_3	1	0.019307	0.005369	12.9336	0.0003
Bonus*Zone_5	1	0.027528	0.007469	13.5859	0.0002
BonusKilometres_1Zone_6	1	0.031829	0.012152	6.8603	0.0088
Bonus*Make_5	1	-0.055964	0.018012	9.6539	0.0019
BonusKilometres_1Make_1	1	-0.063791	0.025037	6.4914	0.0108
BonusZone_1Make_2	1	-0.107665	0.039459	7.4448	0.0064
BonusKilometres_2Zone_1*Make_8	1	0.166930	0.043453	14.7578	0.0001
BonusKilometres_3Zone_5*Make_2	1	0.101962	0.045120	5.1068	0.0238
BonusKilometres_4Zone_1*Make_5	1	0.098160	0.045496	4.6550	0.0310
BonusKilometres_4Zone_6*Make_4	1	0.233105	0.094307	6.1097	0.0134
Dispersion	1	349.647997	28.740411	.	.
Power	1	1.363043	0.008891	.	.

The following SAS statements generate a scatter plot that compares the model predictions with the observed values of the response variable.

proc sort data=tweedie out=tweedie;
  by payment;
run;

proc sgplot data=tweedie;
  scatter x=payment y=pred / legendlabel="Predicted";
  series x=payment y=payment / lineattrs=(pattern=solid color=red)
                               legendlabel="45 degree line";
  yaxis label="Predicted";
run;

Figure 2 shows that the predictions of the final model compare favorably with the observed responses.

Figure 2: Scatter Plot of Predicted versus Observed Payments

References

Andrews, D. F. and Herzberg, A. M. (1985), A Collection of Problems from Many Fields for the Student and Research Worker, New York: Springer-Verlag.
Dunn, P. K. and Smyth, G. K. (2005), “Series Evaluation of Tweedie Exponential Dispersion Model Densities,” Statistics and Computing, 15, 267–280.
Frees, E. W. (2010), Regression Modeling with Actuarial and Financial Applications, Cambridge: Cambridge University Press.
Jørgensen, B. and Paes de Souza, M. C. (1994), “Fitting Tweedie’s Compound Poisson Model to Insurance Claims Data,” Scandinavian Actuarial Journal, 1, 69–93.