topic Which clustering method to use in PROC CLUSTER after inputting Gower dissimilarity matrix? in SAS Data Science

Which clustering method to use in PROC CLUSTER after inputting Gower dissimilarity matrix?

RobertF — Wed, 05 Apr 2017 17:52:34 GMT

I'm planning on performing a cluster analysis in SAS EG 6.1 on a health care dataset containing about 4,000 observations and ~100 nominal, interval, and ratio variables. The Gower similarity coefficient is a recommended distance measure for mixed variables types, which can be calculated using the DISTANCE procedure. The Gower dissimilarity matrix generated from PROC DISTANCE, gower_distance, will then be used as the input dataset for PROC CLUSTER.

proc distance data=health_data method=dgower out=gower_distance;
	var nominal(...) interval(...) ratio(...);
	id member_id;
run;

Question: Which clustering method is recommended in PROC CLUSTER for a Gower dissimilarity matrix?

Clustering methods that use an Euclidean distance measure, such as Centroid and Ward's Minimum Variance, can be ruled out, but that leaves a number of options.

Thanks!

Re: Which clustering method to use in PROC CLUSTER after inputting Gower dissimilarity matrix?

YingjianWang — Fri, 07 Apr 2017 18:28:39 GMT

Hi Robert,

Let's first look at what is a Gower similarity:

1) for interval parts: S_{ijk} = 1 − |x_{ik} − x_{jk}| / r_{k}, where i,j are the indices of the two observations, k is the index of the variable, r_{k} is the range of the k^{th} variable. In other words, the Gower similarity is 'one minus the normalized Manhattan distance'.

2) for nominal parts: S_{ijk} = 1 if x_{ik} = x_{jk}, or 0 if x_{ik} != x_{jk}. In other words, the Gower similarity is 'one minus the binary distance'.

Since for the proc distance, a Gower dissimilarity is calculated, so there is no 'one minus' in the above equations. So the Gower dissimilarity can be regarded as a type of distance (for mixed type input with both interval and nominal variables).

And since the input is a distance matrix, not a data table with each row is an observation, we can use the 'proc cluster' for the clustering as shown in the below example, which produces a tree to show the clustering structure in the data.

title 'Protein Consumption in Europe';
   proc distance data=Protein out=Dist method=Euclid;
      var interval(RedMeat--FruitVeg / std=Std);
      id Country;
   run;

proc cluster data=Dist method=Ward outtree=Tree noprint;
   id Country;
run;
   
axis1 order=(0 to 1 by 0.1);
proc tree data=Tree haxis=axis1 horizontal;
   height _rsq_;
   id Country;
run;

Besides, I would recommend you to use the 'proc kclus' as the clustering method. 'proc kclus' provides the k-prototypes clustering algorithm for mixed type input. And for the interval part, there are options of distance as Euclidean and Manhattan; for nominal part, there are Binary, GlobalFreq, and RelativeFreq. This indeed covers the Gower similarity as a special case. For the details of using 'proc kclus', please see the SAS Proc document for it.

Best Regards,

Yingjian

Re: Which clustering method to use in PROC CLUSTER after inputting Gower dissimilarity matrix?

RobertF — Fri, 07 Apr 2017 19:20:04 GMT

Yingjian,

Thank you for responding. PROC KCLUS looks interesting - looks like I can access PROC KCLUS by downloading the free 14 day trial for SAS Viya.

In my question, I picked the Gower dissimilarity distance for the method in PROC DISTANCE (METHOD=DGOWER), however after checking the SAS documentation there is also the option to choose the Gower similarity distance (METHOD=GOWER) if this is the more correct methodology.

I'm hoping I can then use PROC CLUSTER in the base SAS STAT module. Maybe average linkage would be appropriate?

Robert

Re: Which clustering method to use in PROC CLUSTER after inputting Gower dissimilarity matrix?

YingjianWang — Fri, 07 Apr 2017 19:46:48 GMT

Hi Robert,

Since we need a distance measure to feed into the 'proc cluster', 'proc distance (method=dgower)' produces a dissimilarity which is in the same meaning of a distance measure. So this is what we need here.

Yes, there are 11 different methods in finding the hierarchical clustering structure in the data provided by the 'proc cluster'. These methods have different focus on how to find the distance between clusters. 'average linkage' take the distances of each pair of data in 2 clusters as the distance between them. It is a commonly-chosen method and it may also fit your case. The detailed descriptions about these 11 methods, the definitions, focuses, and references, can be found in the SAS document as below. You may want to read it and compare the advantages and disadvantages and choose the best one for your case.

https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_cluster_sect012.htm

Yingjian