Solved: Re: How to interpret results when decision tree used with standardised...

geniusgenie · Posted 06-01-2017 07:11 PM

Hi,

I was wondering if someone can clear my concepts about a question?

I am using decision tree with standaridized data, in my previous that was whether I should used standardized data with decision tree or not, one of the members suggested that it wont hurt my model....

The problem I am facing is that few columns in both standardized or non standardized shape are harder to interpret. The difference standardization makes is to bring them in a range. Now atleast I know that my standardized data falls between 0 and 1.

Only thing to worry is "How should I interpret the results in Tree algorithms" ??

Should I maintain data dictionary with both Unique standardized and Non Standardized values and compare them or some other recommended way?

Regards

DougWielenga · Posted 01-09-2019 10:56 AM

I am using decision tree with standaridized data, in my previous that was whether I should used standardized data with decision tree or not, one of the members suggested that it wont hurt my model....

The problem I am facing is that few columns in both standardized or non standardized shape are harder to interpret. The difference standardization makes is to bring them in a range. Now atleast I know that my standardized data falls between 0 and 1.

Only thing to worry is "How should I interpret the results in Tree algorithms" ??

Should I maintain data dictionary with both Unique standardized and Non Standardized values and compare them or some other recommended way?

My question to you is why are you using standardized data in the first place? Trees are usually of interest when interpretability is desired. Standardizing your data can make that almost impossible. Interpreting splits based on a variable containing 'price in dollars' is easily understood while replacing that variable with 'log(price in dollars' is almost completely useless. Note that the Tree is creating splits so in theory, any ordinally equivalent set of values (values that are naturally ordered in the same way) will generate the same results. In practice, transforming your input variables can change which split points are considered since not every single possible value is necessarily considered for splitting. If the tree differs in how it splits the observations at any level, all subsequent splits will likewise be impacted.

Consider that modeling really represents multiple activities, sometimes done jointly:

* variables are selected for use in possible models

* candidate models are constructed

* candidate models are evaluated to choose final model

* interpretation is attempted

If your goal is interpretation, the standardization is not likely to help you. If the transformations do generate a better overall model, you can also consider keeping both the transformed and non-transformed values. In this case, you might build the initial model considering both transformed and non-transformed variables as potential input variables, but then perform an analysis of the predicted outcome (based on some threshold you specify) using only the non-transformed variables in a Segment Profile node. In this way you can get the potential benefits of transforming the inputs as well as a more accessible interpretation. In most cases, however, I would not expect all of the extra work to result in meaningful differences since the Tree models are not impacted directly by scale or location changes.

I hope this helps!

Cordially,

Doug

View solution in original post

Reeza · Posted 06-01-2017 07:15 PM

Standardizing is interesting with interpretations, can you transform back to make the rules interpretable?

geniusgenie · Posted 06-01-2017 07:28 PM

Hi Reeza, thanks for your reply, How should I transform back? I am little confused on this. I was thinking to refer back to data dictionary which shows non standardized values of the columns and compare them with standardized ones but still its rough idea.. not sure about it.

Reeza · Posted 06-01-2017 07:33 PM

I think back transforming makes the most sense. The idea behind decision trees is rules humans can read in the end to give them a set of 'rules' to follow. The more difficult you make it the less likely that is to happen.

geniusgenie · Posted 06-01-2017 07:35 PM

I will give it a go and lets hope it gives me something good.

Thanks a lot

DougWielenga · Posted 01-09-2019 10:56 AM

I am using decision tree with standaridized data, in my previous that was whether I should used standardized data with decision tree or not, one of the members suggested that it wont hurt my model....

The problem I am facing is that few columns in both standardized or non standardized shape are harder to interpret. The difference standardization makes is to bring them in a range. Now atleast I know that my standardized data falls between 0 and 1.

Only thing to worry is "How should I interpret the results in Tree algorithms" ??

Should I maintain data dictionary with both Unique standardized and Non Standardized values and compare them or some other recommended way?

My question to you is why are you using standardized data in the first place? Trees are usually of interest when interpretability is desired. Standardizing your data can make that almost impossible. Interpreting splits based on a variable containing 'price in dollars' is easily understood while replacing that variable with 'log(price in dollars' is almost completely useless. Note that the Tree is creating splits so in theory, any ordinally equivalent set of values (values that are naturally ordered in the same way) will generate the same results. In practice, transforming your input variables can change which split points are considered since not every single possible value is necessarily considered for splitting. If the tree differs in how it splits the observations at any level, all subsequent splits will likewise be impacted.

Consider that modeling really represents multiple activities, sometimes done jointly:

* variables are selected for use in possible models

* candidate models are constructed

* candidate models are evaluated to choose final model

* interpretation is attempted

If your goal is interpretation, the standardization is not likely to help you. If the transformations do generate a better overall model, you can also consider keeping both the transformed and non-transformed values. In this case, you might build the initial model considering both transformed and non-transformed variables as potential input variables, but then perform an analysis of the predicted outcome (based on some threshold you specify) using only the non-transformed variables in a Segment Profile node. In this way you can get the potential benefits of transforming the inputs as well as a more accessible interpretation. In most cases, however, I would not expect all of the extra work to result in meaningful differences since the Tree models are not impacted directly by scale or location changes.

I hope this helps!

Cordially,

Doug

How to interpret results when decision tree used with standardised data

Re: How to interpret results when decision tree used with standardised data

Re: How to interpret results when decision tree used with standardised data

Re: How to interpret results when decision tree used with standardised data

Re: How to interpret results when decision tree used with standardised data

Re: How to interpret results when decision tree used with standardised data

Re: How to interpret results when decision tree used with standardised data