07-25-2016 04:50 AM - edited 07-25-2016 04:57 AM
I'm reading mixed messages on this, and seeking clarification from you.
Many say to standardize each variable before doing clustering.
"Standardize the data; since it is based on Euclidean distance, we need to change all the factors into the same scale."
proc standard data=mydata mean=0 std=1 out=mydata1; var x1 x2 ... xn; run;
In my particular case, however, keeping the variables in their current form seems valuable, for comparison purposes.
The cluster proc I am presently using is Fastclus.
Here is the code I'm using:
proc fastclus data=n_slope_means maxc=21 maxiter=100 out=clus; var _50501 _50502 _50503 _50504 _50505 _50508; run;
By not standardizing am I simply getting bogus garbage, on which nothing can be based?
Any thoughts appreciated.
07-25-2016 10:10 AM
The following quote is from p. 43 of Gan, Ma, and Wu, (2007) Data Clustering:
"It is necessary to standardize variables in cases where the dissimilarity measure, such as the Euclidean distance, is senesitive to the differences in the magnitudes or scales of the input variables." They reference Milligan and Cooper (1988), "A study of standardization of variables in cluster analysis."
So, yes, when you have continuous variables and are using a clustering algorithm that relies on distance to cluster the observations, you should either standardize or transform the variables. If you standardize, you have lots of options. I recommend PROC STDIZE rather than PROC STANDARD because PROC STDIZE enables you to do robust scaling if your data have outliers. For example, you can use the median as a location parameter and use a robust estimate of scale as a scaling parameter.
The other option is to use a transformation on the data first, such as a PCA. That might make interpretation harder because you are then clustering in the space of principal components.
Intuitively, you standardize to eliminate the measurement scale. If I measure lengths by centimeters, I want the same clusters as if I measure in meters. If I measure height above sea level, I want the same clusters as if I measure height above the local ground. Standardization makes sure that results do not depend on the units I used to measure the variables.
You ask if the result is bogus if you don't standardize. Not necessarily. If the variables are all measured in the same (or similar) units, then your analysis has meaning. For example, you are probably okay if your variables are all times in seconds or distances in meters. However, if you are clustering variables that represent heights, weights, ages, and income, I would recommend that you standardize.
From your code (data=n_slope_means), maybe you are clustering slopes? If so, I think that a transformation to angles might be preferable.