Solved: Gradient Boosting Output Understanding in EM

subham · Posted 10-18-2016 11:05 PM

Hi

My objective is to use gradient boosting as an alternative to credit scorecard (commonly built by logistic regression). Therefore the target variable is good(0) and bad(1), independent variables are mostly continuous. As per the theory grad boosting creates tree in each iteration with different set of variables. EM also produces a code by which I can score the validation data. However it will be great if you guys can help me with the answers of my queries listed below

1. How to get the list of the variables appeared in different iterations?

2. I have to present the final output to layman, therefore I need the final model with all the splits, What is the best way to get this?

3. Is multicolinearity problem considered in grad boosting?

4. How to deal with the ratio variables with special values (say for denominator zero the variable will take value -99). I want it as a separate split in tree

Appreciate your efforts in advance

PadraicGNeville · Posted 10-24-2016 03:26 PM

Hi, Subham.

Unfortunately, the Boosting node in EM is not designed to output information about the individual trees beyond the number of leaves. The EM node invokes PROC TREEBOOST to build the boosted model. An expert user might know how to run PROC TREEBOOST directly in an EM Code node. In that case, one could include the RULES= dataset, STATSBYNODE= dataset, and the TOPOLOGY=dataset options in the SAVE statement. Together they would describe all the splits and node statistics. However, they are not documented or really supported by tech support for the boosting node.

In my opinion multicolinearity is not a problem. The boosting algorithm does not invert a matrix, so there is no concern about matrix ranks.

Regarding the ratio variable, if there are no other missing values in the data, then simply set the -99 values to missing. "Missing" is a special value that boosting will assign to the best branch independently of the other values. The algorithm also considers splitting Missing vs Non-Missing.

If the ratio variable contains other missing values that you do not want to merge with the special -99 value, then one approach is to create a second variable with value 0 for observations with non-special ratio values and 1 for observations with the special value. In the original ratio variable, replace the -99 value with the average of the legitimate values. Setting it to the average value is an attempt to make those observations uninfluential in the split search. If there are more special values than just -99, then create a separate value for each in the second variable, and declare that variable nominal instead of binary.

This approach is just an idea. I cannot think of a better approach.

Good luck.

Padraic

View solution in original post

PadraicGNeville · Posted 10-24-2016 03:26 PM

Hi, Subham.

Unfortunately, the Boosting node in EM is not designed to output information about the individual trees beyond the number of leaves. The EM node invokes PROC TREEBOOST to build the boosted model. An expert user might know how to run PROC TREEBOOST directly in an EM Code node. In that case, one could include the RULES= dataset, STATSBYNODE= dataset, and the TOPOLOGY=dataset options in the SAVE statement. Together they would describe all the splits and node statistics. However, they are not documented or really supported by tech support for the boosting node.

In my opinion multicolinearity is not a problem. The boosting algorithm does not invert a matrix, so there is no concern about matrix ranks.

Regarding the ratio variable, if there are no other missing values in the data, then simply set the -99 values to missing. "Missing" is a special value that boosting will assign to the best branch independently of the other values. The algorithm also considers splitting Missing vs Non-Missing.

If the ratio variable contains other missing values that you do not want to merge with the special -99 value, then one approach is to create a second variable with value 0 for observations with non-special ratio values and 1 for observations with the special value. In the original ratio variable, replace the -99 value with the average of the legitimate values. Setting it to the average value is an attempt to make those observations uninfluential in the split search. If there are more special values than just -99, then create a separate value for each in the second variable, and declare that variable nominal instead of binary.

This approach is just an idea. I cannot think of a better approach.

Good luck.

Padraic

subham · Posted 10-25-2016 02:42 PM

Thanks Padraic for your valuable inputs. I can understand that EM gradient boosting algorithm is still a black box and we have to rely on it till it produces better result over conventional methods.

Gradient Boosting Output Understanding in EM

Re: Gradient Boosting Output Understanding in EM

Re: Gradient Boosting Output Understanding in EM

Re: Gradient Boosting Output Understanding in EM