BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
subham
Calcite | Level 5

Hi

My objective is to use gradient boosting as an alternative to credit scorecard (commonly built by logistic regression). Therefore the target variable is good(0) and bad(1), independent variables are mostly continuous. As per the theory grad boosting creates tree in each iteration with different set of variables. EM also produces a code by which I can score the validation data. However it will be great if you guys can help me with the answers of my queries listed below

 

1. How to get the list of the variables appeared in different iterations?

2. I have to present the final output  to layman, therefore I need the final model with all the splits, What is the best way to get this?

3. Is multicolinearity problem considered in grad boosting?

4. How to deal with the ratio variables with special values (say for denominator zero the variable will take value -99). I want it as a separate split in tree

 

Appreciate your efforts in advance

1 ACCEPTED SOLUTION

Accepted Solutions
PadraicGNeville
SAS Employee

Hi, Subham.

 

Unfortunately, the Boosting node in EM is not designed to output information about the individual trees beyond the number of leaves.   The EM node invokes PROC TREEBOOST to build the boosted model.  An expert user might know how to run PROC TREEBOOST directly in an EM Code node.  In that case, one could include the RULES= dataset, STATSBYNODE= dataset, and the TOPOLOGY=dataset options in the SAVE statement.  Together they would describe all the splits and node statistics.  However, they are not documented or really supported by tech support for the boosting node.

 

In my opinion multicolinearity is not a problem.   The boosting algorithm does not invert a matrix, so there is no concern about matrix ranks.

 

Regarding the ratio variable, if there are no other missing values in the data, then simply set the -99 values to missing.  "Missing" is a special value that boosting will assign to the best branch independently of the other values.  The algorithm also considers splitting Missing vs Non-Missing.

 

If the ratio variable contains other missing values that you do not want to merge with the special -99 value, then one approach is to create a second variable with value 0 for observations with non-special ratio values and 1 for observations with the special value.  In the original ratio variable, replace the -99 value with the average of the legitimate values.  Setting it to the average value is an attempt to make those observations uninfluential in the split search.   If there are more special values than just -99, then create a separate value for each in the second variable, and declare that variable nominal instead of binary.

 

This approach is just an idea.  I cannot think of a better approach.

 

Good luck.

 

Padraic

 

View solution in original post

2 REPLIES 2
PadraicGNeville
SAS Employee

Hi, Subham.

 

Unfortunately, the Boosting node in EM is not designed to output information about the individual trees beyond the number of leaves.   The EM node invokes PROC TREEBOOST to build the boosted model.  An expert user might know how to run PROC TREEBOOST directly in an EM Code node.  In that case, one could include the RULES= dataset, STATSBYNODE= dataset, and the TOPOLOGY=dataset options in the SAVE statement.  Together they would describe all the splits and node statistics.  However, they are not documented or really supported by tech support for the boosting node.

 

In my opinion multicolinearity is not a problem.   The boosting algorithm does not invert a matrix, so there is no concern about matrix ranks.

 

Regarding the ratio variable, if there are no other missing values in the data, then simply set the -99 values to missing.  "Missing" is a special value that boosting will assign to the best branch independently of the other values.  The algorithm also considers splitting Missing vs Non-Missing.

 

If the ratio variable contains other missing values that you do not want to merge with the special -99 value, then one approach is to create a second variable with value 0 for observations with non-special ratio values and 1 for observations with the special value.  In the original ratio variable, replace the -99 value with the average of the legitimate values.  Setting it to the average value is an attempt to make those observations uninfluential in the split search.   If there are more special values than just -99, then create a separate value for each in the second variable, and declare that variable nominal instead of binary.

 

This approach is just an idea.  I cannot think of a better approach.

 

Good luck.

 

Padraic

 

subham
Calcite | Level 5

Thanks Padraic for your valuable inputs. I can understand that EM gradient boosting algorithm is still a black box and we have to rely on it till it produces better result over conventional methods. 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 2025 views
  • 0 likes
  • 2 in conversation