Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

SAS miner internal standardization property

Reply
New Contributor
Posts: 4

SAS miner internal standardization property

Hello everyone,

I have a clustering case i am working on. Before applying the cluster node i have a transformation node which takes the log of all variables to be used in the cluster node. 


However, the cluster node itself has a internal standardization property which can be set to none,range or standardization. My question is if i already have somewhat normally distributed data from the log transformation then should this be set to None? if not, then how do i figure if range or standardization is the way to go. 

 

I am using only interval variables for this analysis. Thanks

SAS Employee
Posts: 4

Re: SAS miner internal standardization property

Posted in reply to hassan_masood90

Hi, 

 

- Even in cases that we have a normal distributed data as the input to clustering, we can still set some standardization on it. For example, in the case that the input follows a normal distribution with mean \mu and standard deviation \sigma, and for the standardization we choose 'std', then the input is converted to (still) a normal distribution with mean 0 and standard deviation 1.

 

- To set the standarization as 'std' or 'range' results in different outputs. 'std' is to remove the mean and divide by the standard deviation of the data; 'range' is to remove the minimum and devide by the range (max - min), so 'range' will convert all the input values to non-negative.

 

- Both the 2 ways of standardization, 'std' and 'range', are linear transforms. They don't change the clusters structure in the data when an Euclidean distance is in use.

New Contributor
Posts: 4

Re: SAS miner internal standardization property

Posted in reply to YingjianWang
Hi,
Thank you for that response. The distinction between normalization and standardization is more clear with your answer.

In your answer you said the 'std' or 'range' results in different outputs. You also said that since both are linear transformation it doesnt change the structure of the clusters when Euclidean distance is in use.
So what's the different output if the clusters dont change?
And how would one generally decide which internal standardization method is the best one for a particular dataset?
Ask a Question
Discussion stats
  • 2 replies
  • 295 views
  • 0 likes
  • 2 in conversation