Hi,
I was wondering if someone can clear my concepts about a question?
I am using decision tree with standaridized data, in my previous that was whether I should used standardized data with decision tree or not, one of the members suggested that it wont hurt my model....
The problem I am facing is that few columns in both standardized or non standardized shape are harder to interpret. The difference standardization makes is to bring them in a range. Now atleast I know that my standardized data falls between 0 and 1.
Only thing to worry is "How should I interpret the results in Tree algorithms" ??
Should I maintain data dictionary with both Unique standardized and Non Standardized values and compare them or some other recommended way?
Regards
I am using decision tree with standaridized data, in my previous that was whether I should used standardized data with decision tree or not, one of the members suggested that it wont hurt my model....
The problem I am facing is that few columns in both standardized or non standardized shape are harder to interpret. The difference standardization makes is to bring them in a range. Now atleast I know that my standardized data falls between 0 and 1.
Only thing to worry is "How should I interpret the results in Tree algorithms" ??
Should I maintain data dictionary with both Unique standardized and Non Standardized values and compare them or some other recommended way?
My question to you is why are you using standardized data in the first place? Trees are usually of interest when interpretability is desired. Standardizing your data can make that almost impossible. Interpreting splits based on a variable containing 'price in dollars' is easily understood while replacing that variable with 'log(price in dollars' is almost completely useless. Note that the Tree is creating splits so in theory, any ordinally equivalent set of values (values that are naturally ordered in the same way) will generate the same results. In practice, transforming your input variables can change which split points are considered since not every single possible value is necessarily considered for splitting. If the tree differs in how it splits the observations at any level, all subsequent splits will likewise be impacted.
Consider that modeling really represents multiple activities, sometimes done jointly:
* variables are selected for use in possible models
* candidate models are constructed
* candidate models are evaluated to choose final model
* interpretation is attempted
If your goal is interpretation, the standardization is not likely to help you. If the transformations do generate a better overall model, you can also consider keeping both the transformed and non-transformed values. In this case, you might build the initial model considering both transformed and non-transformed variables as potential input variables, but then perform an analysis of the predicted outcome (based on some threshold you specify) using only the non-transformed variables in a Segment Profile node. In this way you can get the potential benefits of transforming the inputs as well as a more accessible interpretation. In most cases, however, I would not expect all of the extra work to result in meaningful differences since the Tree models are not impacted directly by scale or location changes.
I hope this helps!
Cordially,
Doug
Standardizing is interesting with interpretations, can you transform back to make the rules interpretable?
I think back transforming makes the most sense. The idea behind decision trees is rules humans can read in the end to give them a set of 'rules' to follow. The more difficult you make it the less likely that is to happen.
I am using decision tree with standaridized data, in my previous that was whether I should used standardized data with decision tree or not, one of the members suggested that it wont hurt my model....
The problem I am facing is that few columns in both standardized or non standardized shape are harder to interpret. The difference standardization makes is to bring them in a range. Now atleast I know that my standardized data falls between 0 and 1.
Only thing to worry is "How should I interpret the results in Tree algorithms" ??
Should I maintain data dictionary with both Unique standardized and Non Standardized values and compare them or some other recommended way?
My question to you is why are you using standardized data in the first place? Trees are usually of interest when interpretability is desired. Standardizing your data can make that almost impossible. Interpreting splits based on a variable containing 'price in dollars' is easily understood while replacing that variable with 'log(price in dollars' is almost completely useless. Note that the Tree is creating splits so in theory, any ordinally equivalent set of values (values that are naturally ordered in the same way) will generate the same results. In practice, transforming your input variables can change which split points are considered since not every single possible value is necessarily considered for splitting. If the tree differs in how it splits the observations at any level, all subsequent splits will likewise be impacted.
Consider that modeling really represents multiple activities, sometimes done jointly:
* variables are selected for use in possible models
* candidate models are constructed
* candidate models are evaluated to choose final model
* interpretation is attempted
If your goal is interpretation, the standardization is not likely to help you. If the transformations do generate a better overall model, you can also consider keeping both the transformed and non-transformed values. In this case, you might build the initial model considering both transformed and non-transformed variables as potential input variables, but then perform an analysis of the predicted outcome (based on some threshold you specify) using only the non-transformed variables in a Segment Profile node. In this way you can get the potential benefits of transforming the inputs as well as a more accessible interpretation. In most cases, however, I would not expect all of the extra work to result in meaningful differences since the Tree models are not impacted directly by scale or location changes.
I hope this helps!
Cordially,
Doug
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.