Tree-Based Machine Learning Methods and Model Interpretability in SAS® Viya® Q&A, Slides, and On-Demand Recording

Watch this Ask the Expert session to learn how to build machine learning pipelines in SAS Viya.

You will learn:

How to use SAS Model Studio to build machine learning pipelines to improve your predictive modeling workflow.
How nonlinear models, such as tree-based models, can improve model accuracy as compared to traditional linear models.
How to add interpretability plots to black box models and how to use these plots to understand and improve predictive models.

The questions from the Q&A segment held at the end of the webinar are listed below and the slides from the webinar are attached.

Q&A

Can you use a regression tree for a continuous dependent variable?

Regression trees are suitable for a continuous dependent variable. In fact, if you build a decision tree model on a continuous target (the machine learning lingo for a dependent variable) you will be using a regression tree.

Does this have an open source node as does EM and can you utilize PROC Python within it?

We do have the open source code node, and you don't even need to use PROC PYTHON. You can just connect to a Python install and execute Python (or R) code from a Model Studio node. This node can be used for data preprocessing or for machine learning, but does require data transfer (from the distributed computing environment to the Python environment)

Do you need to bring the data partitioning node to the pipeline? If it is done automatically, could you please show where it is done?

The partition is done when the project is created in Model Studio. You have the option to turn it off, but you don't have to include a node for it (like in EM). This can be found in Model Studio under the Project Settings -> Partition Settings option.

Did your dataset of ~50,000 records include both training and validation data? Or did you upload two (or more) datasets?

I uploaded one dataset and it was split automatically by the graphical interface. There is an option to manage the partition settings under Advanced -> Partition Settings while you are naming and creating the project. You can also access these settings from inside an already created project under Project Settings.

In the open source nodes, can you use native code or do you need to run R under Proc IML and Python under PROC Python?

You can execute native Python or R code!

At the end of the class, can we go back to view a few charts to show us what may be signs of overfitting? (So we know what to watch out for.)

If you see training performance is good but validation is bad, that is one of the best ways to tell.

My company is entertaining the idea of upgrading to Viya - will all the features be available, or do we have to sign (pay) for some of the features? We have separate products like base, EG, EM, etc.

I'm so excited to hear you are thinking about moving to SAS Viya. It would be best to follow up with your SAS Account Executive and they can walk you through how the SAS Viya products translate to the SAS 9 products.

Do you have a Model Interpretability option in SAS EM? (I am still an EM user)

There are macros. where you can build partial dependence semi manually. Now the macros will do it a little bit for you in SAS Enterprise Miner, but we don't have it included in the graphical interface in an easy way like we do in Model Studio. That is one of the advantages of upgrading. Here is a link to a paper with code for this in EM: https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/1950-2018.pdf

Can these models automatically look for interactions in features, for example, handset age and customer age?

The logistic regression model will not automatically look for interactions because it just creates the linear model. But the nonlinear models are fundamentally looking at interactions with how they're fitting the data. Basically, they look for nonlinear relationships between the inputs and the target, but they're not going to create an interaction variable and report out information about it. It's going to be mixed up behind the scenes in the complex nonlinear function that relates the inputs to the target. With decision trees, the nonlinear function is all those splits – those if then sequences. But for something like a neural network, it's going to be some complex nonlinear function of all the input variables.

What will be considered a threshold for ROC or is it always based on the training ROC value?

A threshold like the worst RC, which is 0.5 for the area under the ROC curve is a random guessing model, so you want to be above that. In my mind, a value of around 0.6 is a weak model, whereas a value above 0.7 is a strong model. I would compare it to the training data to make sure we're not overfitting, but ultimately you want a model that predicts. Well, it does a good job making predictions on both the training and the validation data, so I usually want to make sure my ROC is above about 0.7. But for absolute scale, I generally like to look at the misclassification rate because that's easy to interpret in the business context. We want our misclassification rate to be less than 10%, for example, so that's kind of how I would set a standard rather than comparing it to the training data.

For local interpretability plots, is there a way to select which observation you are looking at?

Yes, you can specify up to 5 observations. You can also specify a key variable in the Data tab and then use that variable to select specific observations from the data.

When the target is imbalanced where one outcome is much less likely to occur than it's complement (ex. cancer diagnosis), is there an option that improves prediction in forests? I know the randomForestSRC package includes the imbalanced function that samples the rarer event more frequently. Does SAS have a similar option available?

Generally, you don’t need to balance classes unless the outcome is very rare (<10%), but there is the option to do event-based sampling in Model Studio. You would modify these settings in the project settings at the same time and place you choose the partition settings (before anything is ran in the pipeline).

Do you have guidelines for sample size for Trees?

In general, you want to have enough training data to build a sensible model. This really depends on the kind of data and the modeling goal, but a good rule of thumb is to have at least 1000 observations in the training data for the tree-based models, although more observations will lead to a better model. A smaller sample size for the training data means a higher chance of overfitting, so that is the main thing to watch out for with small sample sizes. Also, keep in mind that the number of columns will play a large role in how many observations you need – as ideally you need multiple representations of every potential combination of columns.

How do we use custom function for auto-tuning? What if I wanted to select the hyperparameters based on Train and Test Gini?

The autotuning algorithm allows you to select the assessment metric for the autotuning, which means you can choose how you want it to select hyperparameters. This can be found under Autotuning -> General Options in the node options. You can choose how the algorithm splits the training data and the objective function for autotuning.

How are the cut-offs in the nodes for the splits determined?

Interval variables are binned so that they are not continuous. In the demo, Ari showed that the “Number of bins” was set to 100. This means that every numeric/interval variable is binned into 100 bins (can be set to quantile binning or bucket binning). Then, when the tree is determining splits it runs through all of the variables (binned numeric/interval and categorical) to determine what the best split would be and chooses that.

Recommended Resources

SAS Tutorial: Machine Learning Fundamentals

SAS Tutorial: Machine Learning: A Coding Example in SAS

SAS Tutorial: How to Choose a Machine Learning Algorithm

SAS Tutorial: Interpreting Machine Learning Models in SAS

Moving from SAS®9 to SAS® Viya®

Please see additional resources in the attached slide deck (PDF).

Want more tips? Be sure to subscribe to the Ask the Expert board to receive follow up Q&A, slides and recordings from other SAS Ask the Expert webinars.

Tree-Based Machine Learning Methods and Model Interpretability in SAS® Viya® Q&A, Slides, and On-Demand Recording

Q&A

Recommended Resources

Click image to register for webinar

Classroom Training Available!