Hi, Subham.
Unfortunately, the Boosting node in EM is not designed to output information about the individual trees beyond the number of leaves. The EM node invokes PROC TREEBOOST to build the boosted model. An expert user might know how to run PROC TREEBOOST directly in an EM Code node. In that case, one could include the RULES= dataset, STATSBYNODE= dataset, and the TOPOLOGY=dataset options in the SAVE statement. Together they would describe all the splits and node statistics. However, they are not documented or really supported by tech support for the boosting node.
In my opinion multicolinearity is not a problem. The boosting algorithm does not invert a matrix, so there is no concern about matrix ranks.
Regarding the ratio variable, if there are no other missing values in the data, then simply set the -99 values to missing. "Missing" is a special value that boosting will assign to the best branch independently of the other values. The algorithm also considers splitting Missing vs Non-Missing.
If the ratio variable contains other missing values that you do not want to merge with the special -99 value, then one approach is to create a second variable with value 0 for observations with non-special ratio values and 1 for observations with the special value. In the original ratio variable, replace the -99 value with the average of the legitimate values. Setting it to the average value is an attempt to make those observations uninfluential in the split search. If there are more special values than just -99, then create a separate value for each in the second variable, and declare that variable nominal instead of binary.
This approach is just an idea. I cannot think of a better approach.
Good luck.
Padraic
... View more