Hello everyone,
I have a clustering case i am working on. Before applying the cluster node i have a transformation node which takes the log of all variables to be used in the cluster node.
However, the cluster node itself has a internal standardization property which can be set to none,range or standardization. My question is if i already have somewhat normally distributed data from the log transformation then should this be set to None? if not, then how do i figure if range or standardization is the way to go.
I am using only interval variables for this analysis. Thanks
Hi,
- Even in cases that we have a normal distributed data as the input to clustering, we can still set some standardization on it. For example, in the case that the input follows a normal distribution with mean \mu and standard deviation \sigma, and for the standardization we choose 'std', then the input is converted to (still) a normal distribution with mean 0 and standard deviation 1.
- To set the standarization as 'std' or 'range' results in different outputs. 'std' is to remove the mean and divide by the standard deviation of the data; 'range' is to remove the minimum and devide by the range (max - min), so 'range' will convert all the input values to non-negative.
- Both the 2 ways of standardization, 'std' and 'range', are linear transforms. They don't change the clusters structure in the data when an Euclidean distance is in use.
Hi,
- Even in cases that we have a normal distributed data as the input to clustering, we can still set some standardization on it. For example, in the case that the input follows a normal distribution with mean \mu and standard deviation \sigma, and for the standardization we choose 'std', then the input is converted to (still) a normal distribution with mean 0 and standard deviation 1.
- To set the standarization as 'std' or 'range' results in different outputs. 'std' is to remove the mean and divide by the standard deviation of the data; 'range' is to remove the minimum and devide by the range (max - min), so 'range' will convert all the input values to non-negative.
- Both the 2 ways of standardization, 'std' and 'range', are linear transforms. They don't change the clusters structure in the data when an Euclidean distance is in use.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.