BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
geniusgenie
Obsidian | Level 7

Hi,

I wonder if anyone can help me about some simple questions, I have a labelled dataset on which I am looking to apply decision tree, neural network, SVM and random forest algorithms. 

I have done basic normalization and standardization on all columns and left only three columns which contain 0 or 1 values as a flag 

for example three flags called read, write and execute which may only contain 0 or 1 as a value. Further on my main target variable called CAT which was initially containing only two values 0f 1 or 2 for two categories lets say hardware =1 and software=2. 

 

My standardization routine also changed it to -1.598497 for hardware and 0.625538 for software.  

 

My first question is do i really need to convert this CAT variable to standardized values for above mentioned algorithms or I can ignore it for this column and use 1 and 2 as normal values. 

 

My second question , if I replace my values manually for these two columns with 0 for hardware and 1 for software. Is it a bad practice or going to create wrong results as compare to the values of 1 and 2 or -1.598497 and 0.625538.

 

Please help me about this, which one of these values should be appropriate for ANN, DTs,RF and SVM.

 

Regards

 

1 ACCEPTED SOLUTION

Accepted Solutions
geniusgenie
Obsidian | Level 7
Thanks a lot Paigemiller and Reeza i will follow your suggestions. Hope to get good results.

Regards

View solution in original post

3 REPLIES 3
PaigeMiller
Diamond | Level 26

Generally, these algorithms react to the variance of the input variables, and so setting the variance of ALL Input variables to 1 makes each variable a priori have equal importance. If you leave the 0/1 binary variables as 0/1, then these will have a different variance and become less important — or more important — than the other variables. So, a good first analysis would not use 0/1, but it would use the standardardized values.

--
Paige Miller
Reeza
Super User

If you search "Andrew Gelman Variable Standardization" you'll get some interesting background thoughts on standardizing variables including binary variables. The last two links are quite informative IMO. 

 

http://andrewgelman.com/2009/07/11/when_to_standar/

http://andrewgelman.com/2012/08/18/standardizing-regression-inputs/

http://www.stat.columbia.edu/~gelman/research/published/standardizing7.pdf

geniusgenie
Obsidian | Level 7
Thanks a lot Paigemiller and Reeza i will follow your suggestions. Hope to get good results.

Regards

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 5217 views
  • 2 likes
  • 3 in conversation