BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
geniusgenie
Obsidian | Level 7

Hi,

I was wondering if someone can clear my concepts about a question?

I am using decision tree with standaridized data, in my previous that was whether I should used standardized data with decision tree or not, one of the members suggested that it wont hurt my model.... 

 

The problem I am facing is that few columns in both standardized or non standardized shape are harder to interpret. The difference standardization makes is to bring them in a range. Now atleast I know that my standardized data falls between 0 and 1. 

 

Only thing to worry is "How should I interpret the results in Tree algorithms" ??

Should I maintain data dictionary with both Unique standardized and Non Standardized values and compare them or some other recommended way?

 

Regards

 

1 ACCEPTED SOLUTION

Accepted Solutions
DougWielenga
SAS Employee

I am using decision tree with standaridized data, in my previous that was whether I should used standardized data with decision tree or not, one of the members suggested that it wont hurt my model.... 

 

The problem I am facing is that few columns in both standardized or non standardized shape are harder to interpret. The difference standardization makes is to bring them in a range. Now atleast I know that my standardized data falls between 0 and 1. 

 

Only thing to worry is "How should I interpret the results in Tree algorithms" ??

Should I maintain data dictionary with both Unique standardized and Non Standardized values and compare them or some other recommended way?

 

My question to you is why are you using standardized data in the first place?  Trees are usually of interest when interpretability is desired.   Standardizing your data can make that almost impossible.   Interpreting splits based on a variable containing 'price in dollars' is easily understood while replacing that variable with  'log(price in dollars' is almost completely useless.   Note that the Tree is creating splits so in theory, any ordinally equivalent set of values (values that are naturally ordered in the same way) will generate the same results.  In practice, transforming your input variables can change which split points are considered since not every single possible value is necessarily considered for splitting.  If the tree differs in how it splits the observations at any level, all subsequent splits will likewise be impacted.   

 

Consider that modeling really represents multiple activities, sometimes done jointly:

    * variables are selected for use in possible models

    * candidate models are constructed

    * candidate models are evaluated to choose final model

    * interpretation is attempted

 

If your goal is interpretation, the standardization is not likely to help you.  If the transformations do generate a better overall model, you can also consider keeping both the transformed and non-transformed values.  In this case, you might build the initial model considering both transformed and non-transformed variables as potential input variables, but then perform an analysis of the predicted outcome (based on some threshold you specify) using only the non-transformed variables in a Segment Profile node.  In this way you can get the potential benefits of transforming the inputs as well as a more accessible interpretation.  In most cases, however, I would not expect all of the extra work to result in meaningful differences since the Tree models are not impacted directly by scale or location changes.

 

I hope this helps!


Cordially,

Doug

 

 

View solution in original post

5 REPLIES 5
Reeza
Super User

Standardizing is interesting with interpretations, can you transform back to make the rules interpretable?

geniusgenie
Obsidian | Level 7
Hi Reeza, thanks for your reply, How should I transform back? I am little confused on this. I was thinking to refer back to data dictionary which shows non standardized values of the columns and compare them with standardized ones but still its rough idea.. not sure about it.
Reeza
Super User

I think back transforming makes the most sense. The idea behind decision trees is rules humans can read in the end to give them a set of 'rules' to follow. The more difficult you make it the less likely that is to happen. 

 

 

geniusgenie
Obsidian | Level 7
I will give it a go and lets hope it gives me something good.

Thanks a lot
DougWielenga
SAS Employee

I am using decision tree with standaridized data, in my previous that was whether I should used standardized data with decision tree or not, one of the members suggested that it wont hurt my model.... 

 

The problem I am facing is that few columns in both standardized or non standardized shape are harder to interpret. The difference standardization makes is to bring them in a range. Now atleast I know that my standardized data falls between 0 and 1. 

 

Only thing to worry is "How should I interpret the results in Tree algorithms" ??

Should I maintain data dictionary with both Unique standardized and Non Standardized values and compare them or some other recommended way?

 

My question to you is why are you using standardized data in the first place?  Trees are usually of interest when interpretability is desired.   Standardizing your data can make that almost impossible.   Interpreting splits based on a variable containing 'price in dollars' is easily understood while replacing that variable with  'log(price in dollars' is almost completely useless.   Note that the Tree is creating splits so in theory, any ordinally equivalent set of values (values that are naturally ordered in the same way) will generate the same results.  In practice, transforming your input variables can change which split points are considered since not every single possible value is necessarily considered for splitting.  If the tree differs in how it splits the observations at any level, all subsequent splits will likewise be impacted.   

 

Consider that modeling really represents multiple activities, sometimes done jointly:

    * variables are selected for use in possible models

    * candidate models are constructed

    * candidate models are evaluated to choose final model

    * interpretation is attempted

 

If your goal is interpretation, the standardization is not likely to help you.  If the transformations do generate a better overall model, you can also consider keeping both the transformed and non-transformed values.  In this case, you might build the initial model considering both transformed and non-transformed variables as potential input variables, but then perform an analysis of the predicted outcome (based on some threshold you specify) using only the non-transformed variables in a Segment Profile node.  In this way you can get the potential benefits of transforming the inputs as well as a more accessible interpretation.  In most cases, however, I would not expect all of the extra work to result in meaningful differences since the Tree models are not impacted directly by scale or location changes.

 

I hope this helps!


Cordially,

Doug

 

 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 3029 views
  • 1 like
  • 3 in conversation