08-15-2014 11:03 AM
Bonjour tout le monde/Good afternoon everyone!
I've tried to use cluster analysis to combine small groups of similar risks (same caracteristics) to allow easier incorporation into GLMs (proc GENMOD here).
I've met some difficulties to make the link between step 1 and step 2. I have a traditionnal insurance table.
Question 1 (step 1): May I add the dependant variable freq_adj as SUPPLEMENTARY VARIABLE?
Question 2 (step 2): PROC GENMOD I obtain my four clusters but how applied to proc GENMOD? Where can I integrate my clusters?
**STEP 1: MIX CLUSTERING ANALYSIS
PROC FASTCLUS DATA=ins.insurance MAXC=20 MAXITER=50 CONVERGE=0.01 MEAN=centres OUT=partial CLUSTER=cluster DELETE=5 DRIFT;
VAR ageconducteur region ;
PROC CLUSTER DATA=centres OUTTREE=tree METHOD=ward CCC PSEUDO PRINT=10;
VAR ageconducteur region;
PROC SORT DATA=tree;
PROC TREE DATA=tree NCL=4 OUT=segm1 ;
COPY presegm ;
PROC SORT DATA=partial; BY cluster; RUN;
PROC SORT DATA=segm1; BY cluster; RUN;
MERGE partial segm1;
**STEP 2: PROC GENMOD
PROC GENMOD DATA = ???; ODS OUTPUT ParameterEstimates=Genmod1_Param ;
class ageconducteur ;
MODEL freq_adj = ageconducteur region / maxiter=2000 dist = poisson link = log;
format ageconducteur forage.; output out=poisson; RUN; QUIT;
Thanks for your help.
Ce message a été modifié par : CHARBIT Jonathan
08-15-2014 06:22 PM
freq_adj is my dependant variable (number of claims). This variable didn't integrated in cluster analysis because I don't manage to make the link between cluster analysis and proc genmod,gam...
08-15-2014 11:53 PM
Please correct me if I am wrong
freq_adj is included in the ins.insurance data and you just used predictors to run cluster analysis and ended up with four clusters solution right?
Now you want to run model for freq_adj using dataset that has 4 clusters right?
08-16-2014 06:48 AM
I'am lost :smileysilly:
yes it is, freq_adj is included in the ins.insurance data.
"you just used predictors to run cluster analysis". Must I run a regression model before cluster analysis?
Yes I want to run model for freq_adj (number of claims) using dataset that has 4 clusters right thanks to cluster analysis.
Thanks for your time.
08-16-2014 07:54 AM
So you are trying to run 4 models for 4 clusters after merging freq_adj variable to cluster dataset with the objective to produce better results within each cluster right?
08-16-2014 08:11 AM
Yes that was one of my ideas combining groups of similar risks and use proc genmod for each cluster to extract predictors.
I don't know if in the area of insurance (or another) is an acceptable method and how incorporate in GLMS?
08-16-2014 12:14 PM
Idea looks right but clustering can produce better predictions as compared to overall model if freq_adj is significantly different across 4 clusters.
08-16-2014 03:09 PM
Ok I'm going to develop that idea.
According to your experiency, what is the best method to check if there is a significant heterogeineity across 4 clusters? How can I compare an overall model (a single GLM) and 4 GLMS?
Have a good day.
08-16-2014 09:51 PM
Proc ANOVA can be used to check differences among 4 clusters. To learn more, why did you use first PROC FASTCLUS then PROC CLUSTER for cluster solution and why creating 4 clusters only?
08-17-2014 06:31 AM
Because I have a big data (many clients) so I began with PROC FASTCLUS then I took back mean (mean=CENTRES) to run PROC CLUSTER. That is a MIX CAH method.
4 clusters because in my PROC CLUSTER I interpreted the CCC,semi-partial R sqared...indicators and what the dendogram showed.
I will run PROC ANOVA to see if there is a significant difference between clusters.
So if the stat test is no significant, my predictors will be less acurate than overall model.
Step 1: MIX CAH
Step 2: PROC ANOVA
Step 3: GLMS for each cluster if anova release a significant difference between clusters.
I think the number of claims or claim costs can be very volatile between clusters depending on the guarantee. I'm gonna to see.
08-17-2014 01:59 PM
GLM to estimate the pure prenium (frequency of claims*claim costs)...I've began that since I red the following text:
I've found this text on casact.org (Casualty Actuarial Society):
"Cluster analysis applies a collection of different algorithms to group these units into clusters based on historical
experience, modeled experience, or well-defined similarity rules. This allows easier incorporation into
it is essential to take into account the heterogeneity in pricing yet.....I don't understand their reasoning.
GLM to estimate the pure prenium (frequency of claims*claim costs).
1) Classical method: (This method is without STEP 1)
The average claim frequency for customers in Area A1 and in the ageGroup 20-29 is then:
0,044 * 0,689 * 0,472 = 0,014
In the same way we calculate the average claim size for this group to be
61037 * 1,873 * 0,789 = 90211
The pure premium for this group is then 0,014*90211=1263.
Do you understand my questions?
Thanks for your help.