turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- Normalization and standardization

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-14-2017 10:31 AM

Hi,

I wonder if anyone can help me about some simple questions, I have a labelled dataset on which I am looking to apply decision tree, neural network, SVM and random forest algorithms.

I have done basic normalization and standardization on all columns and left only three columns which contain 0 or 1 values as a flag

for example three flags called read, write and execute which may only contain 0 or 1 as a value. Further on my main target variable called CAT which was initially containing only two values 0f 1 or 2 for two categories lets say hardware =1 and software=2.

My standardization routine also changed it to -1.598497 for hardware and 0.625538 for software.

My first question is do i really need to convert this CAT variable to standardized values for above mentioned algorithms or I can ignore it for this column and use 1 and 2 as normal values.

My second question , if I replace my values manually for these two columns with 0 for hardware and 1 for software. Is it a bad practice or going to create wrong results as compare to the values of 1 and 2 or -1.598497 and 0.625538.

Please help me about this, which one of these values should be appropriate for ANN, DTs,RF and SVM.

Regards

Accepted Solutions

Solution

05-15-2017
09:40 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to geniusgenie

05-14-2017 03:13 PM

Thanks a lot Paigemiller and Reeza i will follow your suggestions. Hope to get good results.

Regards

Regards

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to geniusgenie

05-14-2017 12:04 PM

Generally, these algorithms react to the variance of the input variables, and so setting the variance of ALL Input variables to 1 makes each variable *a priori* have equal importance. If you leave the 0/1 binary variables as 0/1, then these will have a different variance and become less important — or more important — than the other variables. So, a good first analysis would not use 0/1, but it would use the standardardized values.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to geniusgenie

05-14-2017 03:06 PM

If you search "Andrew Gelman Variable Standardization" you'll get some interesting background thoughts on standardizing variables including binary variables. The last two links are quite informative IMO.

http://andrewgelman.com/2009/07/11/when_to_standar/

http://andrewgelman.com/2012/08/18/standardizing-regression-inputs/

http://www.stat.columbia.edu/~gelman/research/published/standardizing7.pdf

Solution

05-15-2017
09:40 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to geniusgenie

05-14-2017 03:13 PM

Thanks a lot Paigemiller and Reeza i will follow your suggestions. Hope to get good results.

Regards

Regards