BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
hassan_masood90
Calcite | Level 5

Hello everyone,

I have a clustering case i am working on. Before applying the cluster node i have a transformation node which takes the log of all variables to be used in the cluster node. 


However, the cluster node itself has a internal standardization property which can be set to none,range or standardization. My question is if i already have somewhat normally distributed data from the log transformation then should this be set to None? if not, then how do i figure if range or standardization is the way to go. 

 

I am using only interval variables for this analysis. Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
YingjianWang
SAS Employee

Hi, 

 

- Even in cases that we have a normal distributed data as the input to clustering, we can still set some standardization on it. For example, in the case that the input follows a normal distribution with mean \mu and standard deviation \sigma, and for the standardization we choose 'std', then the input is converted to (still) a normal distribution with mean 0 and standard deviation 1.

 

- To set the standarization as 'std' or 'range' results in different outputs. 'std' is to remove the mean and divide by the standard deviation of the data; 'range' is to remove the minimum and devide by the range (max - min), so 'range' will convert all the input values to non-negative.

 

- Both the 2 ways of standardization, 'std' and 'range', are linear transforms. They don't change the clusters structure in the data when an Euclidean distance is in use.

View solution in original post

2 REPLIES 2
YingjianWang
SAS Employee

Hi, 

 

- Even in cases that we have a normal distributed data as the input to clustering, we can still set some standardization on it. For example, in the case that the input follows a normal distribution with mean \mu and standard deviation \sigma, and for the standardization we choose 'std', then the input is converted to (still) a normal distribution with mean 0 and standard deviation 1.

 

- To set the standarization as 'std' or 'range' results in different outputs. 'std' is to remove the mean and divide by the standard deviation of the data; 'range' is to remove the minimum and devide by the range (max - min), so 'range' will convert all the input values to non-negative.

 

- Both the 2 ways of standardization, 'std' and 'range', are linear transforms. They don't change the clusters structure in the data when an Euclidean distance is in use.

hassan_masood90
Calcite | Level 5
Hi,
Thank you for that response. The distinction between normalization and standardization is more clear with your answer.

In your answer you said the 'std' or 'range' results in different outputs. You also said that since both are linear transformation it doesnt change the structure of the clusters when Euclidean distance is in use.
So what's the different output if the clusters dont change?
And how would one generally decide which internal standardization method is the best one for a particular dataset?

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 2975 views
  • 0 likes
  • 2 in conversation