04-17-2017 04:49 PM
I have a clustering case i am working on. Before applying the cluster node i have a transformation node which takes the log of all variables to be used in the cluster node.
However, the cluster node itself has a internal standardization property which can be set to none,range or standardization. My question is if i already have somewhat normally distributed data from the log transformation then should this be set to None? if not, then how do i figure if range or standardization is the way to go.
I am using only interval variables for this analysis. Thanks
04-19-2017 01:31 PM
- Even in cases that we have a normal distributed data as the input to clustering, we can still set some standardization on it. For example, in the case that the input follows a normal distribution with mean \mu and standard deviation \sigma, and for the standardization we choose 'std', then the input is converted to (still) a normal distribution with mean 0 and standard deviation 1.
- To set the standarization as 'std' or 'range' results in different outputs. 'std' is to remove the mean and divide by the standard deviation of the data; 'range' is to remove the minimum and devide by the range (max - min), so 'range' will convert all the input values to non-negative.
- Both the 2 ways of standardization, 'std' and 'range', are linear transforms. They don't change the clusters structure in the data when an Euclidean distance is in use.
04-19-2017 02:40 PM