I am using decision tree with standaridized data, in my previous that was whether I should used standardized data with decision tree or not, one of the members suggested that it wont hurt my model....
The problem I am facing is that few columns in both standardized or non standardized shape are harder to interpret. The difference standardization makes is to bring them in a range. Now atleast I know that my standardized data falls between 0 and 1.
Only thing to worry is "How should I interpret the results in Tree algorithms" ??
Should I maintain data dictionary with both Unique standardized and Non Standardized values and compare them or some other recommended way?
My question to you is why are you using standardized data in the first place? Trees are usually of interest when interpretability is desired. Standardizing your data can make that almost impossible. Interpreting splits based on a variable containing 'price in dollars' is easily understood while replacing that variable with 'log(price in dollars' is almost completely useless. Note that the Tree is creating splits so in theory, any ordinally equivalent set of values (values that are naturally ordered in the same way) will generate the same results. In practice, transforming your input variables can change which split points are considered since not every single possible value is necessarily considered for splitting. If the tree differs in how it splits the observations at any level, all subsequent splits will likewise be impacted.
Consider that modeling really represents multiple activities, sometimes done jointly:
* variables are selected for use in possible models
* candidate models are constructed
* candidate models are evaluated to choose final model
* interpretation is attempted
If your goal is interpretation, the standardization is not likely to help you. If the transformations do generate a better overall model, you can also consider keeping both the transformed and non-transformed values. In this case, you might build the initial model considering both transformed and non-transformed variables as potential input variables, but then perform an analysis of the predicted outcome (based on some threshold you specify) using only the non-transformed variables in a Segment Profile node. In this way you can get the potential benefits of transforming the inputs as well as a more accessible interpretation. In most cases, however, I would not expect all of the extra work to result in meaningful differences since the Tree models are not impacted directly by scale or location changes.
I hope this helps!
Cordially,
Doug
... View more