turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- Which clustering method to use in PROC CLUSTER aft...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-05-2017 01:52 PM

I'm planning on performing a cluster analysis in SAS EG 6.1 on a health care dataset containing about 4,000 observations and ~100 nominal, interval, and ratio variables. The Gower similarity coefficient is a recommended distance measure for mixed variables types, which can be calculated using the DISTANCE procedure. The Gower dissimilarity matrix generated from PROC DISTANCE, gower_distance, will then be used as the input dataset for PROC CLUSTER.

proc distance data=health_data method=dgower out=gower_distance; var nominal(...) interval(...) ratio(...); id member_id; run;

**Question: Which clustering method is recommended in PROC CLUSTER for a Gower dissimilarity matrix?**

Clustering methods that use an Euclidean distance measure, such as Centroid and Ward's Minimum Variance, can be ruled out, but that leaves a number of options.

Thanks!

Accepted Solutions

Solution

04-11-2017
10:51 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to RobertF

04-07-2017 03:46 PM

Hi Robert,

Since we need a distance measure to feed into the 'proc cluster', 'proc distance (method=dgower)' produces a dissimilarity which is in the same meaning of a distance measure. So this is what we need here.

Yes, there are 11 different methods in finding the hierarchical clustering structure in the data provided by the 'proc cluster'. These methods have different focus on how to find the distance between clusters. 'average linkage' take the distances of each pair of data in 2 clusters as the distance between them. It is a commonly-chosen method and it may also fit your case. The detailed descriptions about these 11 methods, the definitions, focuses, and references, can be found in the SAS document as below. You may want to read it and compare the advantages and disadvantages and choose the best one for your case.

Yingjian

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to RobertF

04-07-2017 02:28 PM

Hi Robert,

Let's first look at what is a Gower similarity:

1) for interval parts: S_{ijk} = 1 − |x_{ik} − x_{jk}| / r_{k}, where i,j are the indices of the two observations, k is the index of the variable, r_{k} is the range of the k^{th} variable. In other words, the Gower similarity is 'one minus the normalized Manhattan distance'.

2) for nominal parts: S_{ijk} = 1 if x_{ik} = x_{jk}, or 0 if x_{ik} != x_{jk}. In other words, the Gower similarity is 'one minus the binary distance'.

Since for the proc distance, a Gower dissimilarity is calculated, so there is no 'one minus' in the above equations. So the Gower dissimilarity can be regarded as a type of distance (for mixed type input with both interval and nominal variables).

And since the input is a distance matrix, not a data table with each row is an observation, we can use the 'proc cluster' for the clustering as shown in the below example, which produces a tree to show the clustering structure in the data.

title 'Protein Consumption in Europe'; proc distance data=Protein out=Dist method=Euclid; var interval(RedMeat--FruitVeg / std=Std); id Country; run;

proc cluster data=Dist method=Ward outtree=Tree noprint; id Country; run; axis1 order=(0 to 1 by 0.1);

proc tree data=Tree haxis=axis1 horizontal; height _rsq_; id Country; run;

Besides, I would recommend you to use the 'proc kclus' as the clustering method. 'proc kclus' provides the k-prototypes clustering algorithm for mixed type input. And for the interval part, there are options of distance as Euclidean and Manhattan; for nominal part, there are Binary, GlobalFreq, and RelativeFreq. This indeed covers the Gower similarity as a special case. For the details of using 'proc kclus', please see the SAS Proc document for it.

Best Regards,

Yingjian

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to YingjianWang

04-07-2017 03:20 PM

Yingjian,

Thank you for responding. PROC KCLUS looks interesting - looks like I can access PROC KCLUS by downloading the free 14 day trial for SAS Viya.

In my question, I picked the Gower dissimilarity distance for the method in PROC DISTANCE (METHOD=DGOWER), however after checking the SAS documentation there is also the option to choose the Gower similarity distance (METHOD=GOWER) if this is the more correct methodology.

I'm hoping I can then use PROC CLUSTER in the base SAS STAT module. Maybe average linkage would be appropriate?

Robert

Solution

04-11-2017
10:51 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to RobertF

04-07-2017 03:46 PM

Hi Robert,

Since we need a distance measure to feed into the 'proc cluster', 'proc distance (method=dgower)' produces a dissimilarity which is in the same meaning of a distance measure. So this is what we need here.

Yes, there are 11 different methods in finding the hierarchical clustering structure in the data provided by the 'proc cluster'. These methods have different focus on how to find the distance between clusters. 'average linkage' take the distances of each pair of data in 2 clusters as the distance between them. It is a commonly-chosen method and it may also fit your case. The detailed descriptions about these 11 methods, the definitions, focuses, and references, can be found in the SAS document as below. You may want to read it and compare the advantages and disadvantages and choose the best one for your case.

Yingjian