Data-Driven Analytics in SAS Viya – Decision Tree Model Results

1 Like

Getting Started

In today’s post, we’ll take a look at how to interpret the results of a decision tree built in SAS Viya. In my last post of this series, I showed you just how easy it was to build a decision tree in SAS Visual Statistics. I also discussed the origins of decision trees and some of the options available when building them. Moving forward we will continue to focus on the part of the AI and Analytics lifecycle that involves developing and interpreting robust models. Specifically, let’s examine the various pieces of output from the decision tree that was built using variable annuity (insurance product) data.

Insurance Data

Remember, the business challenge is trying to identify customers who are likely to respond to a variable annuity marketing campaign and make a purchase. The develop_final table that was introduced previously contains just over 32,000 banking customers. The input variables reflect both demographic information as well as product usage captured over a three-month period. The target variable is named Ins which is a binary variable.

Autotuning

Remember in my previous post, we examined the Summary Bar to find the model fit statistic KS (Youden) with a value of 0.4098. This statistic ranges from 0 to 1 with higher numbers indicating a better model. It's fair to ask the question "Can we do better?" In fact, there is a surprisingly easy feature that we can take advantage of. It is called autotuning. Autotuning is a process where a model's hyperparameters are automatically adjusted using algorithms to create several versions of the same model. These versions are then compared to find out which set of hyperparameters works best for that model. But wait a minute, what is a hyperparameter? A hyperparameter is simply one of the model options that can be fine-tuned in order to improve a model's performance. Unfortunately, optimal values for your hyperparameters cannot be calculated from the data. So, if you don't use autotuning, you are stuck with trial and error to try and find better values for your model.

A good example of a hyperparameter for decision trees is the number of levels in the tree. In other words, how many levels deep should we make our decision tree for the insurance data? Should it be a maximum of 6 levels deep (which is the default value)? Or could I improve the model's KS value by adding more levels or less levels to the tree? I could try building a whole bunch of decision trees with varying levels and compare them, but that would be so time consuming! This is where autotuning can help us out.

Selecting Autotune in the options pane of the decision tree object will open up the Autotuning Setup window.

The Autotuning Setup window basically allows us to do two things. It controls how the tuning should proceed, and which hyperparameters should be autotuned. We don't have time to cover all these features in detail, but let's cover this window at a very high level. (The documentation has many more details.) The Search method is the algorithm that determines the hyperparameters values that are used in each iteration of the autotuning process. It defaults to the Genetic algorithm that is initially seeded from Latin hypercube samples. The Objective metric is the statistic that is used to compare the autotuned models against each other. For categorical response models, Misclassification rate is the default. For measure response models, Average square error is the default.

Maximum minutes is the maximum amount of time that the model training process runs in minutes. Maximum number of models is the maximum number of different models that are created during the autotuning process. You can open up the Hyperparameters section in this window to reveal the parameters that are autotuned for decision trees are the following: maximum levels, leaf size, and predictor bins. Remember we already covered maximum levels and leaf size in the last post. The Predictor bins option specifies the number of bins used to categorize any and every predictor that is a measure variable. The default is 50. Adding more bins increases model complexity because the decision tree has more potential splits to consider. Increasing the number of bins can also lead to finer granularity in capturing patterns as well as potential overfitting.

When I perform autotuning on our insurance data, I get the following results.

The Maximum levels is tuned from 2 to 12. The Leaf size is increased from 6 to 231. The Predictor bins is tuned from 50 to 155. The great news is that as a result of autotuning, the KS statistic also increased from 0.4098 to 0.4380.

Decision Tree Results

You may remember from my last post we ended by covering the following results at a very high level:

The summary bar across the top of the page.
The Tree Window containing both the Decision Tree and the Icicle Plot.
The Decision Tree, which is an interactive, navigational tree-map.
The Icicle Plot revealing a detailed hierarchical breakdown of the tree data.
The Variable Importance Plot displaying the importance of each variable.
The Confusion Matrix revealing the correct and incorrect classifications.

We want to take a deep dive into each of these items, but let’s make it a little easier on ourselves by taking advantage of a Model Display option. We've done this before and it's easy to just open the Options pane of the Decision Tree and scroll down to the Model Display options. Under the General category, change the Plot layout from Fit(default) to Stack. This model display option specifies how the subplots are displayed on the canvas. By default, we see all the output subplots shown together on one page. We can enhance viewability by changing the Plot layout to Stack such that each subplot fills the canvas. Using this option, a control bar enables you to move between subplots.

Summary Bar

Examining the Summary Bar at the top of the of the canvas lets us know several things. We have created a decision tree on the target variable Ins. Our model has chosen an event level of 1, which means our model is designed to predict those customers that purchase an annuity. After autotuning, the default model fit statistic KS (Youden) now has a value of 0.4380. And there were some 32K observations used in the building of this model.

Decision Tree (tree-map)

This interactive and navigational decision tree displays the node statistics and the node rules. To navigate the decision tree easily, you can use the mouse. Click the mouse button and hold it down anywhere in the Tree window to move the decision tree within the window. Scroll to zoom in and out of the decision tree. Scroll up to zoom in and scroll down to zoom out. The zoom is always centered on the position of the mouse pointer.

We start with the tree completely zoomed out, so we have the least amount of detail. The color of the node in the tree-map indicates the predicted level for that node. It is indicative of the event level that has the most observations in the node. We can see that we have a mixture of nodes that are primarily purchasers (yellow) and primarily non-purchasers (blue). We can also see that the first split (the root node at the top of the tree) is based on Saving Balance. The first split in a decision tree plays a pivotal role in shaping the tree's structure, performance, and ability to generalize. It sets the stage for the entire model's accuracy and efficiency. Changes near the top of the tree typically cascade down through the remainder of the tree. Let's zoom in on one of the terminal nodes and see what kind of detail is revealed.

Zooming in to and selecting Node ID 8 reveals an amazing amount of detail about these 1,491 customers. First, I can see that they have just over a 65% predicted probability of being purchasers. Second, I can see the tree path or rules that were followed to create this node. The BIN_DDABal (binned checking balance) has to be 2, 9, or 10. The customer must own a Certificate of Deposit, and their Saving Balance must be less than around $1,550 or missing.

In addition to examining the nodes and rules of the tree-map, there are more features available in the Tree window. You can derive a leaf ID variable. This action creates a category variable that contains the leaf ID for each observation. You can use this variable in other objects throughout SAS Visual Analytics. You can also derive predicted values. Depending on the type of response variable, new variables will be created including predicted values, probability values, and prediction cutoffs. To derive predicted values, right-click in the Tree window, and select Derive predicted.

Interactive Mode

Let's suppose that as a data scientist, you happen to have some business knowledge and you want to modify the design of the decision tree. Just like a gardener would tend to tree, in Interactive Mode you can tend (redesign) the decision tree. To enter Interactive Mode, you can right-click on the decision tree and select the task. To split a leaf node, right-click on a node and select Split (for leaf nodes) or Edit split (for non-leaf nodes). In the Split Node window, you select the variable that is used to split the node. To train a leaf node, right-click on a leaf node and select Train. In the Train Node window, you can specify the variables and the maximum depth of training to train a node. To prune a node, right-click on an interior node, and select Prune to prune the decision tree at that node. This removes all nodes beneath the selected node and turns that node into a leaf node. There is plenty more functionality available to you in Interactive Mode. I encourage you to check out the documentation.

Conclusion

We’ve continued our journey into supervised classification by autotuning a decision tree and beginning to interpret the output in SAS Visual Statistics. As we continue to develop models in the AI an Analytics lifecycle, we will witness even more interesting features and options. In my next post, I’ll finish covering the output results for this decision tree. If you are ready to learn more about decision trees, I can suggest the following two courses: SAS® Visual Statistics in SAS® Viya®: Interactive Model Building and Tree-Based Machine Learning Methods in SAS® Viya®. See you next time and never stop learning!

Find more articles from SAS Global Enablement and Learning here.

SAS Communities Library