BookmarkSubscribeRSS Feed
jonathanch
Calcite | Level 5

Bonjour tout le monde/Good afternoon everyone!

I've tried to use cluster analysis to combine small groups of similar risks (same caracteristics) to allow easier incorporation into GLMs (proc GENMOD here).

I've met some difficulties to make the link between step 1 and step 2. I have a traditionnal insurance table.

Question 1 (step 1): May I add the dependant variable  freq_adj as SUPPLEMENTARY VARIABLE?

Question 2 (step 2): PROC GENMOD I obtain my four clusters but how applied to proc GENMOD? Where can I integrate my clusters?

**STEP 1: MIX CLUSTERING ANALYSIS

PROC FASTCLUS DATA=ins.insurance MAXC=20 MAXITER=50 CONVERGE=0.01 MEAN=centres OUT=partial CLUSTER=cluster DELETE=5 DRIFT;

VAR ageconducteur region ;

RUN;

PROC CLUSTER DATA=centres OUTTREE=tree METHOD=ward CCC PSEUDO PRINT=10;

VAR ageconducteur region;

COPY cluster;

RUN;

PROC SORT DATA=tree;

BY _ncl_;

RUN;

PROC TREE DATA=tree NCL=4 OUT=segm1 ;

COPY presegm ;

RUN ;

PROC SORT DATA=partial; BY cluster; RUN;

PROC SORT DATA=segm1; BY cluster; RUN;

DATA segm;

MERGE partial segm1;

BY cluster;

RUN;

**STEP 2: PROC GENMOD

PROC GENMOD DATA = ???; ODS OUTPUT ParameterEstimates=Genmod1_Param ;

class ageconducteur ;

weight exposition; 

MODEL freq_adj = ageconducteur region /  maxiter=2000  dist = poisson link = log;

format ageconducteur forage.;   output out=poisson; RUN; QUIT;

Thanks for your help.

Ce message a été modifié par : CHARBIT Jonathan

16 REPLIES 16
stat_sas
Ammonite | Level 13

Hi,

What is the source of dataset segm1?

jonathanch
Calcite | Level 5

An oversight on my behalf Smiley Happy thanks. I modified

stat_sas
Ammonite | Level 13

What is freq_adj? Is that frequency variable based on 4 clusters?

jonathanch
Calcite | Level 5

freq_adj is my dependant variable (number of claims). This variable didn't integrated in cluster analysis because I don't manage to make the link between cluster analysis and proc genmod,gam...

stat_sas
Ammonite | Level 13

Please correct me if I am wrong

freq_adj is included in the ins.insurance data and you just used predictors to run cluster analysis and ended up with four clusters solution right?

Now you want to run model for freq_adj using dataset that has 4 clusters right?

jonathanch
Calcite | Level 5

I'am lost :smileysilly:

yes it is, freq_adj is included in the ins.insurance data.

"you just used predictors to run cluster analysis". Must I run a regression model before cluster analysis?

Yes I want to run model for freq_adj (number of claims) using dataset that has 4 clusters right thanks to cluster analysis.

Thanks for your time.

stat_sas
Ammonite | Level 13

So you are trying to run 4 models for 4 clusters after merging freq_adj variable to cluster dataset with the objective to produce better results within each cluster right?

jonathanch
Calcite | Level 5

Yes that was one of my ideas combining groups of similar risks and use proc genmod for each cluster to extract predictors.

I don't know if in the area of insurance (or another) is an acceptable method and how incorporate in GLMS?

stat_sas
Ammonite | Level 13

Idea looks right but clustering can produce better predictions as compared to overall model if freq_adj is significantly different across 4 clusters.

jonathanch
Calcite | Level 5

Ok I'm going to develop that idea.

According to your experiency, what is the best method to check if there is a significant heterogeineity across 4 clusters? How can I compare an overall model (a single GLM) and 4 GLMS?

Have a good day.

stat_sas
Ammonite | Level 13

Proc ANOVA can be used to check differences among 4 clusters. To learn more, why did you use first PROC FASTCLUS then PROC CLUSTER for cluster solution and why creating 4 clusters only?

jonathanch
Calcite | Level 5

Because I have a big data (many clients) so I began with PROC FASTCLUS then I took back mean (mean=CENTRES) to run PROC CLUSTER. That is a MIX CAH method.

4 clusters because in my PROC CLUSTER I interpreted the CCC,semi-partial R sqared...indicators and what the dendogram showed.

I will run PROC ANOVA to see if there is a significant difference between clusters.

So if the stat test is no significant, my predictors will be less acurate than overall model.

Step 1: MIX CAH

Step 2: PROC ANOVA

Step 3: GLMS for each cluster if anova release a significant difference between clusters.

Right?


I think the number of claims or claim costs can be very volatile between clusters depending on the guarantee. I'm gonna to see.

Thanks.

stat_sas
Ammonite | Level 13

Seems like a right approach. What is step3? Why are you using GLM?

jonathanch
Calcite | Level 5

GLM to estimate the pure prenium (frequency of claims*claim costs)...I've began that since I red the following text:

I've found this text on casact.org (Casualty Actuarial Society):

"Cluster analysis applies a collection of different algorithms to group these units into clusters based on historical

experience, modeled experience, or well-defined similarity rules. This allows easier incorporation into

GLMs."

it is essential to take into account the heterogeneity in pricing yet.....I don't understand their reasoning.

GLM to estimate the pure prenium (frequency of claims*claim costs).

1) Classical method: (This method is without STEP 1)

The average claim frequency for customers in Area A1 and in the ageGroup 20-29 is then:

0,044 * 0,689 * 0,472 = 0,014

intercept=0.044

In the same way we calculate the average claim size for this group to be

61037 * 1,873 * 0,789 = 90211

The pure premium for this group is then 0,014*90211=1263.

(SAS souce)=http://www2.sas.com/proceedings/forum2008/333-2008.pdf

Do you understand my questions?

Thanks for your help.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 16 replies
  • 2448 views
  • 6 likes
  • 2 in conversation