About YingjianWang

YingjianWang · ‎04-19-2017

Hi, - Even in cases that we have a normal distributed data as the input to clustering, we can still set some standardization on it. For example, in the case that the input follows a normal distribution with mean \mu and standard deviation \sigma, and for the standardization we choose 'std', then the input is converted to (still) a normal distribution with mean 0 and standard deviation 1. - To set the standarization as 'std' or 'range' results in different outputs. 'std' is to remove the mean and divide by the standard deviation of the data; 'range' is to remove the minimum and devide by the range (max - min), so 'range' will convert all the input values to non-negative. - Both the 2 ways of standardization, 'std' and 'range', are linear transforms. They don't change the clusters structure in the data when an Euclidean distance is in use.

YingjianWang · ‎04-19-2017

Hi, welcome to SAS. This is a very interesting question of how to determine the number of clusters for a specific data. The 'proper' number of clusters is not only determined by the data itself, but also influenced by the emphasis we put on during the analysis on the data. - For k-means, one of the most distinct issues of it is the requirement to set the number of clusters before the clustering process. k-means algorithm does not provide an estimate of the number of clusters as an output. On the contrary, we need to feed the number of clusters (the 'k') into the k-means for it to work. - All the clustering criterion, Average, Centroid, or Ward etc., have distinct definitions on the distance between clusters. To choose which criterion depends on uses' specific perspective in analyzing the data. For example, 'average linkage' take the distances of each pair of data in 2 clusters as the distance between them. It is a commonly-chosen method and it may also fit your case. The detailed descriptions about these 11 methods, the definitions, focuses, and references, can be found in the SAS document as below. You may want to read it and compare the advantages and disadvantages and choose the best one for your case. https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_cluster_sect012.htm - While in SAS, there is a procedure 'kClus' which provides a Aligned Box Criterion, the 'ABC' algorithm, to heuristically determine the number of clusters before launching a k-means clustering. (See 'proc kclus' SAS document.) - Yes it is very difficult to say which number is the best number of clusters in the data. Indeed, it sometimes difficult to say there are any clusters in the data at all. Few datasets present perfect/uncontroversial clustering structures by themselves. All clustering processes correspond to dimension reduction and information loss to some sense. At the same time, 'soft clustering' like Gaussian mixture model, in a contrast to 'hard clustering' like k-means, yields a probability distribution on all the clusters for each data, which may give more insights than a only cluster index given by the hard clustering in some situations. - The preprocessing of the data before clustering usually will change the clusters structure in them. Since the transformation may distort the defined distance between the data in the space where they live. For example, any non-isometric transformation in the Euclidean space change the clusters structure in the data.

YingjianWang · ‎04-07-2017

Hi Robert, Since we need a distance measure to feed into the 'proc cluster', 'proc distance (method=dgower)' produces a dissimilarity which is in the same meaning of a distance measure. So this is what we need here. Yes, there are 11 different methods in finding the hierarchical clustering structure in the data provided by the 'proc cluster'. These methods have different focus on how to find the distance between clusters. 'average linkage' take the distances of each pair of data in 2 clusters as the distance between them. It is a commonly-chosen method and it may also fit your case. The detailed descriptions about these 11 methods, the definitions, focuses, and references, can be found in the SAS document as below. You may want to read it and compare the advantages and disadvantages and choose the best one for your case. https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_cluster_sect012.htm Yingjian

YingjianWang · ‎04-07-2017

Hi Robert, Let's first look at what is a Gower similarity: 1) for interval parts: S_{ijk} = 1 − |x_{ik} − x_{jk}| / r_{k}, where i,j are the indices of the two observations, k is the index of the variable, r_{k} is the range of the k^{th} variable. In other words, the Gower similarity is 'one minus the normalized Manhattan distance'. 2) for nominal parts: S_{ijk} = 1 if x_{ik} = x_{jk}, or 0 if x_{ik} != x_{jk}. In other words, the Gower similarity is 'one minus the binary distance'. Since for the proc distance, a Gower dissimilarity is calculated, so there is no 'one minus' in the above equations. So the Gower dissimilarity can be regarded as a type of distance (for mixed type input with both interval and nominal variables). And since the input is a distance matrix, not a data table with each row is an observation, we can use the 'proc cluster' for the clustering as shown in the below example, which produces a tree to show the clustering structure in the data. title 'Protein Consumption in Europe'; proc distance data=Protein out=Dist method=Euclid; var interval(RedMeat--FruitVeg / std=Std); id Country; run; proc cluster data=Dist method=Ward outtree=Tree noprint; id Country; run; axis1 order=(0 to 1 by 0.1); proc tree data=Tree haxis=axis1 horizontal; height _rsq_; id Country; run; Besides, I would recommend you to use the 'proc kclus' as the clustering method. 'proc kclus' provides the k-prototypes clustering algorithm for mixed type input. And for the interval part, there are options of distance as Euclidean and Manhattan; for nominal part, there are Binary, GlobalFreq, and RelativeFreq. This indeed covers the Gower similarity as a special case. For the details of using 'proc kclus', please see the SAS Proc document for it. Best Regards, Yingjian

Online Status	Offline
Date Last Visited	‎04-26-2017 10:44 AM

Re: SAS miner internal standardization property

Re: Clustering in SAS Miner: Number of clusters determination, input d...

Re: Which clustering method to use in PROC CLUSTER after inputting Gow...

Re: Which clustering method to use in PROC CLUSTER after inputting Gow...

Re: Which clustering method to use in PROC CLUSTER after inputting Gow...

Re: Which clustering method to use in PROC CLUSTER after inputting Gow...

Re: SAS miner internal standardization property

Re: Clustering in SAS Miner: Number of clusters determination, input d...

Re: Which clustering method to use in PROC CLUSTER after inputting Gow...

Re: Which clustering method to use in PROC CLUSTER after inputting Gow...