topic Normalization and standardization in SAS Data Science

Normalization and standardization

geniusgenie — Sun, 14 May 2017 14:31:57 GMT

Hi,

I wonder if anyone can help me about some simple questions, I have a labelled dataset on which I am looking to apply decision tree, neural network, SVM and random forest algorithms.

I have done basic normalization and standardization on all columns and left only three columns which contain 0 or 1 values as a flag

for example three flags called read, write and execute which may only contain 0 or 1 as a value. Further on my main target variable called CAT which was initially containing only two values 0f 1 or 2 for two categories lets say hardware =1 and software=2.

My standardization routine also changed it to -1.598497 for hardware and 0.625538 for software.

My first question is do i really need to convert this CAT variable to standardized values for above mentioned algorithms or I can ignore it for this column and use 1 and 2 as normal values.

My second question , if I replace my values manually for these two columns with 0 for hardware and 1 for software. Is it a bad practice or going to create wrong results as compare to the values of 1 and 2 or -1.598497 and 0.625538.

Please help me about this, which one of these values should be appropriate for ANN, DTs,RF and SVM.

Regards

Re: Normalization and standardization

PaigeMiller — Sun, 14 May 2017 16:04:31 GMT

Generally, these algorithms react to the variance of the input variables, and so setting the variance of ALL Input variables to 1 makes each variable a priori have equal importance. If you leave the 0/1 binary variables as 0/1, then these will have a different variance and become less important — or more important — than the other variables. So, a good first analysis would not use 0/1, but it would use the standardardized values.

Re: Normalization and standardization

Reeza — Sun, 14 May 2017 19:06:31 GMT

If you search "Andrew Gelman Variable Standardization" you'll get some interesting background thoughts on standardizing variables including binary variables. The last two links are quite informative IMO.

http://andrewgelman.com/2009/07/11/when_to_standar/

http://andrewgelman.com/2012/08/18/standardizing-regression-inputs/

http://www.stat.columbia.edu/~gelman/research/published/standardizing7.pdf

Re: Normalization and standardization

geniusgenie — Sun, 14 May 2017 19:13:52 GMT

Thanks a lot Paigemiller and Reeza i will follow your suggestions. Hope to get good results.

Regards