turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Combine cluster analysis with proc GENMOD

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-15-2014 11:03 AM

Bonjour tout le monde/Good afternoon everyone!

I've tried to use cluster analysis to combine small groups of similar risks (same caracteristics) to allow easier incorporation into GLMs (proc GENMOD here).

I've met some difficulties to make the link between step 1 and step 2. I have a traditionnal insurance table.

Question 1 (step 1): May I add the dependant variable freq_adj as SUPPLEMENTARY VARIABLE?

Question 2 (step 2): PROC GENMOD I obtain my four clusters but how applied to proc GENMOD? Where can I integrate my clusters?

**STEP 1: MIX CLUSTERING ANALYSIS

PROC FASTCLUS DATA=ins.insurance MAXC=20 MAXITER=50 CONVERGE=0.01 MEAN=centres OUT=partial CLUSTER=cluster DELETE=5 DRIFT;

VAR ageconducteur region ;

RUN;

PROC CLUSTER DATA=centres OUTTREE=tree METHOD=ward CCC PSEUDO PRINT=10;

VAR ageconducteur region;

COPY cluster;

RUN;

PROC SORT DATA=tree;

BY _ncl_;

RUN;

PROC TREE DATA=tree NCL=4 OUT=segm1 ;

COPY presegm ;

RUN ;

PROC SORT DATA=partial; BY cluster; RUN;

PROC SORT DATA=segm1; BY cluster; RUN;

DATA segm;

MERGE partial segm1;

BY cluster;

RUN;

**STEP 2: PROC GENMOD

PROC GENMOD DATA = ???; ODS OUTPUT ParameterEstimates=Genmod1_Param ;

class ageconducteur ;

weight exposition;

MODEL freq_adj = ageconducteur region / maxiter=2000 dist = poisson link = log;

format ageconducteur forage.; output out=poisson; RUN; QUIT;

Thanks for your help.

Ce message a été modifié par : CHARBIT Jonathan

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-15-2014 01:02 PM

Hi,

What is the source of dataset segm1?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-15-2014 04:40 PM

An oversight on my behalf thanks. I modified

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-15-2014 05:44 PM

What is freq_adj? Is that frequency variable based on 4 clusters?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-15-2014 06:22 PM

freq_adj is my dependant variable (number of claims). This variable didn't integrated in cluster analysis because I don't manage to make the link between cluster analysis and proc genmod,gam...

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-15-2014 11:53 PM

Please correct me if I am wrong

freq_adj is included in the ins.insurance data and you just used predictors to run cluster analysis and ended up with four clusters solution right?

Now you want to run model for freq_adj using dataset that has 4 clusters right?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-16-2014 06:48 AM

I'am lost :smileysilly:

yes it is, freq_adj is included in the ins.insurance data.

"you just used predictors to run cluster analysis". Must I run a regression model before cluster analysis?

Yes I want to run model for freq_adj (number of claims) using dataset that has 4 clusters right thanks to cluster analysis.

Thanks for your time.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-16-2014 07:54 AM

So you are trying to run 4 models for 4 clusters after merging freq_adj variable to cluster dataset with the objective to produce better results within each cluster right?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-16-2014 08:11 AM

Yes that was one of my ideas combining groups of similar risks and use proc genmod for each cluster to extract predictors.

I don't know if in the area of insurance (or another) is an acceptable method and how incorporate in GLMS?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-16-2014 12:14 PM

Idea looks right but clustering can produce better predictions as compared to overall model if freq_adj is significantly different across 4 clusters.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-16-2014 03:09 PM

Ok I'm going to develop that idea.

According to your experiency, what is the best method to check if there is a significant heterogeineity across 4 clusters? How can I compare an overall model (a single GLM) and 4 GLMS?

Have a good day.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-16-2014 09:51 PM

Proc ANOVA can be used to check differences among 4 clusters. To learn more, why did you use first PROC FASTCLUS then PROC CLUSTER for cluster solution and why creating 4 clusters only?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-17-2014 06:31 AM

Because I have a big data (many clients) so I began with PROC FASTCLUS then I took back mean (mean=CENTRES) to run PROC CLUSTER. That is a MIX CAH method.

4 clusters because in my PROC CLUSTER I interpreted the CCC,semi-partial R sqared...indicators and what the dendogram showed.

I will run PROC ANOVA to see if there is a significant difference between clusters.

So if the stat test is no significant, my predictors will be less acurate than overall model.

Step 1: MIX CAH

Step 2: PROC ANOVA

Step 3: GLMS for each cluster if anova release a significant difference between clusters.

Right?

I think the number of claims or claim costs can be very volatile between clusters depending on the guarantee. I'm gonna to see.

Thanks.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-17-2014 12:23 PM

Seems like a right approach. What is step3? Why are you using GLM?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-17-2014 01:59 PM

GLM to estimate the pure prenium (frequency of claims*claim costs)...I've began that since I red the following text:

I've found this text on casact.org (Casualty Actuarial Society):

"Cluster analysis applies a collection of different algorithms to group these units into clusters based on historical

experience, modeled experience, or well-defined similarity rules. **This allows easier incorporation into **

**GLMs**."

**it is essential to take into account the heterogeneity in pricing yet.**....I don't understand their reasoning.

GLM to estimate the pure prenium (frequency of claims*claim costs).

1) Classical method:** (This method is without STEP 1)**

The average claim frequency for customers in Area A1 and in the ageGroup 20-29 is then:

0,044 * 0,689 * 0,472 = 0,014

intercept=0.044

In the same way we calculate the average claim size for this group to be

61037 * 1,873 * 0,789 = 90211

The pure premium for this group is then 0,014*90211=1263.

(SAS souce)=http://www2.sas.com/proceedings/forum2008/333-2008.pdf

Do you understand my questions?

Thanks for your help.