BookmarkSubscribeRSS Feed
NKormanik
Barite | Level 11

I'm reading mixed messages on this, and seeking clarification from you.

 

Many say to standardize each variable before doing clustering.

 

"Standardize the data; since it is based on Euclidean distance, we need to change all the factors into the same scale."

 

proc standard data=mydata mean=0 std=1 out=mydata1;
var x1 x2 ... xn;
run;

 

In my particular case, however, keeping the variables in their current form seems valuable, for comparison purposes.

 

The cluster proc I am presently using is Fastclus.

 

Here is the code I'm using:

 

proc fastclus data=n_slope_means
maxc=21 maxiter=100 out=clus;
var
_50501
_50502
_50503
_50504
_50505
_50508;
run;

By not standardizing am I simply getting bogus garbage, on which nothing can be based?

 

Any thoughts appreciated.

 

Nicholas Kormanik

 

 

1 REPLY 1
Rick_SAS
SAS Super FREQ

The following quote is from p. 43 of Gan, Ma, and Wu, (2007) Data Clustering:

"It is necessary to standardize variables in cases where the dissimilarity measure, such as the Euclidean distance, is senesitive to the differences in the magnitudes or scales of the input variables." They reference Milligan and Cooper (1988), "A study of standardization of variables in cluster analysis."

 

So, yes, when you have continuous variables and are using a clustering algorithm that relies on distance to cluster the observations, you should either standardize or transform the variables. If you standardize, you have lots of options. I recommend PROC STDIZE rather than PROC STANDARD because PROC STDIZE enables you to do robust scaling if your data have outliers. For example, you can use the median as a location parameter and use a robust estimate of scale as a scaling parameter.

 

The other option is to use a transformation on the data first, such as a PCA.  That might make interpretation harder because you are then clustering in the space of principal components.

 

Intuitively, you standardize to eliminate the measurement scale. If I measure lengths by centimeters, I want the same clusters as if I measure in meters. If I measure height above sea level, I want the same clusters as if I measure height above the local ground.  Standardization makes sure that results do not depend on the units I used to measure the variables. 

 

You ask if the result is bogus if you don't standardize. Not necessarily. If the variables are all measured in the same (or similar) units, then your analysis has meaning. For example, you are probably okay if your variables are all times in seconds or distances in meters. However, if you are clustering variables that represent heights, weights, ages, and income, I would recommend that you standardize.

 

From your code (data=n_slope_means), maybe you are clustering slopes? If so, I think that a transformation to angles might be preferable.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 1240 views
  • 1 like
  • 2 in conversation