Solved: Re: SAS miner internal standardization property

hassan_masood90 · Posted 04-17-2017 04:49 PM

Hello everyone,

I have a clustering case i am working on. Before applying the cluster node i have a transformation node which takes the log of all variables to be used in the cluster node.

However, the cluster node itself has a internal standardization property which can be set to none,range or standardization. My question is if i already have somewhat normally distributed data from the log transformation then should this be set to None? if not, then how do i figure if range or standardization is the way to go.

I am using only interval variables for this analysis. Thanks

YingjianWang · Posted 04-19-2017 01:31 PM

Hi,

- Even in cases that we have a normal distributed data as the input to clustering, we can still set some standardization on it. For example, in the case that the input follows a normal distribution with mean \mu and standard deviation \sigma, and for the standardization we choose 'std', then the input is converted to (still) a normal distribution with mean 0 and standard deviation 1.

- To set the standarization as 'std' or 'range' results in different outputs. 'std' is to remove the mean and divide by the standard deviation of the data; 'range' is to remove the minimum and devide by the range (max - min), so 'range' will convert all the input values to non-negative.

- Both the 2 ways of standardization, 'std' and 'range', are linear transforms. They don't change the clusters structure in the data when an Euclidean distance is in use.

View solution in original post

YingjianWang · Posted 04-19-2017 01:31 PM

Hi,

- Even in cases that we have a normal distributed data as the input to clustering, we can still set some standardization on it. For example, in the case that the input follows a normal distribution with mean \mu and standard deviation \sigma, and for the standardization we choose 'std', then the input is converted to (still) a normal distribution with mean 0 and standard deviation 1.

- To set the standarization as 'std' or 'range' results in different outputs. 'std' is to remove the mean and divide by the standard deviation of the data; 'range' is to remove the minimum and devide by the range (max - min), so 'range' will convert all the input values to non-negative.

- Both the 2 ways of standardization, 'std' and 'range', are linear transforms. They don't change the clusters structure in the data when an Euclidean distance is in use.

hassan_masood90 · Posted 04-19-2017 02:40 PM

Hi,
Thank you for that response. The distinction between normalization and standardization is more clear with your answer.

In your answer you said the 'std' or 'range' results in different outputs. You also said that since both are linear transformation it doesnt change the structure of the clusters when Euclidean distance is in use.
So what's the different output if the clusters dont change?
And how would one generally decide which internal standardization method is the best one for a particular dataset?