Data-Driven Analytics in SAS Viya – Building a Decision Tree

1 Like

Getting Started

In today’s post, I'll show you how to create decision trees in SAS Visual Statistics. We'll see just how easy it can be to apply this versatile and well-known technique in SAS Viya. This post is my final installment for the year in my comprehensive review of data-driven analytics. We will continue to focus in on the part of the AI and Analytics lifecycle that involves developing robust models using best practices. In previous posts I've discussed the many types of classification techniques including supervised and unsupervised methods. In addition, I've already demonstrated clustering and logistic regression. Today, we will learn how to create, interpret, and optimize decision trees in SAS Visual Statistics.

Introduction

Decision trees are one of the most versatile tools in predictive modeling, offering intuitive insights into complex datasets. SAS Visual Statistics in SAS Viya (a powerful in-memory analytics platform) simplifies the process of building and analyzing decision trees. Whether you're a data scientist, business analyst, or beginner exploring machine learning, SAS Visual Statistics provides an interactive and efficient way to build robust decision trees.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

History

Decision trees, a foundational tool in data science and machine learning, have a rich history rooted in statistical research and decision theory. Their origins trace back to the mid-20th century when statisticians began exploring algorithms in order to simplify decision-making. The concept was heavily influenced by Claude Shannon’s Information Theory, which introduced both measures of entropy and information gain. These are key principles that are still used in modern decision tree algorithms. Early decision tree applications focused on optimizing decision-making in uncertain environments. They eventually found their way into fields like economics, operations research, and artificial intelligence. Over time, as these algorithms became more sophisticated, they evolved from simple binary splits to more complex techniques that could handle both multi-class problems and continuous variables.

By the 1980s, decision trees gained importance with the development of ID3 (Iterative Dichotomiser 3) by Ross Quinlan. ID3 is an algorithm designed to construct trees efficiently by selecting attributes that maximize information gain. Eventually more robust versions like C4.5 and CART (Classification and Regression Trees) introduced pruning techniques to reduce overfitting and improve model generalization. As machine learning has moved into the forefront of business analytics, decision trees became a core building block for ensemble methods like Random Forests and Gradient Boosted Trees, further boosting their predictive power. Today, decision trees remain an essential part of data science, valued for their interpretability, efficiency, and versatility in solving both classification and regression problems.

Insurance Data

As we've done in earlier posts, let's assume that we are a data scientist attempting to use variable annuity (insurance product) data. The table named develop_final is used to identify customers who are likely to respond to a variable annuity marketing campaign. The develop_final table contains just over 32,000 banking customers and input variables that reflect both demographic information as well as product usage captured over a three-month period. The target variable is Ins which is a binary variable. For Ins, a value of 1 indicates the customer purchased a variable annuity product and a 0 indicates they did not. Please take note that I have performed some data clean-up (including binning, transforming, and imputation) and variable selection (using variance explained) so that we are ready to build supervised models. If you’re interested in seeing some of those data cleansing techniques performed on the develop table, please see Supervised Machine Learning Procedures Using SAS® Viya® in SAS® Studio.

Building a Decision Tree in SAS Visual Statistics

From SAS Drive, we open the Visual Analytics web application by clicking the Applications menu icon and selecting Explore and Visualize. From the Explore and Visualize window, we click on New report. In the left-hand Data pane, select Add data. Find and select the DEVELOP_FINAL table and then Add.

With the data already cleaned and prepped for model building, we are ready to create our logistic regression. On the left, we’ll change from the Data pane to the Objects pane by selecting Objects. From this list we can scroll down until we find the list of Statistics objects. From there we can either double-click or drag-and-drop the Decision Tree object onto the first page of the report.

Next click on the Data Roles pane on the right. Assign Ins as the Response and all 36 remaining explanatory variables as Predictors.

Decision Tree Options

Select the Options pane on the right. Before we examine the output for this decision tree object, let’s note that there are many options that are available for building decision trees. The developers at SAS carefully select default values that almost always work well with both the software and your data. Let's examine just a few of those default options for decision trees.

The Maximum branches option specifies the maximum number of branches for each node split. By default, you will construct a tree with just two-way splits. SAS Visual Statistics will allow you to have up to 10 splits per node. Two-way splits yield flexibility and are computationally efficient and widely used. They tend to build deeper trees with simpler splits. Multi-way splits yield shallower trees that are more computationally expensive and at risk for overfitting. They are best reserved for categorical inputs that have multiple distinct levels.

The Maximum levels option specifies the maximum number of levels in the tree. By default, your decision tree will have no more than 6 levels. SAS Visual Statistics will allow you to build trees up to 20 levels deep. Trees with fewer levels tend to be simpler, easier to interpret, and less prone to overfitting. They may underfit complex data by oversimplifying relationships. Trees with more levels can capture intricate patterns and handle complex datasets well. They are often harder to interpret, computationally expensive, and prone to overfitting. Shallow trees generalize better for either small or noisy datasets, while deep trees excel in larger, detailed datasets. Choosing a good depth requires balancing the tradeoff between bias (shallow trees) and variance (deep trees). For more discussion of the bias-variance trade-off, check out the excellent and comprehensive Statistics You Need to Know for Machine Learning.

The Leaf size option specifies the minimum number of cases that are allowed in a leaf node (also known as a terminal node). By default, the minimum number of observations for an individual leaf is 5. Decision trees with smaller leaf sizes tend to be more detailed and capture finer patterns. While this can improve accuracy on training data, it increases the risk of overfitting. Trees with larger leaf sizes can generalize better (less prone to overfitting) and are more computationally efficient. Of course, this comes at the risk of underfitting by oversimplifying the data. In general, smaller leaf size favors complex models and larger leaf size favors simpler models.

Decision Tree Results

While there are other options that are available for investigation, it is time to examine the output for our decision tree based on the annuity data.

Examining the Summary Bar at the top of the of the canvas reveals a few things. We have created a decision tree based on the target variable Ins. Our model has chosen an event level of 1, which means our model is designed to predict those customers that purchase an annuity. The default model fit statistic is KS (Youden) with a value of 0.4098. This statistic ranges from 0 to 1 with higher numbers indicating a better model. And there were 32K observations used in the building of this model.

Underneath the summary bar there are decision tree results including the Tree Window, Icicle Plot, Variable Importance, and Confusion Matrix. Actually, the Tree Window contains both the Tree and the Icicle Plot. Let’s define all of these at a very high level and save the details for my next post. The Tree is a tree-map, which is a navigational tool we can use to interactively investigate the node statistics and node rules. The Icicle Plot gives us a detailed hierarchical breakdown of the tree data, starting with the root node on top. The Variable Importance plot gives us the most important information for all effects in the tree. And finally, the Confusion Matrix displays a summary of both the correct and incorrect classification for both the “event” and the “non-event.” It helps to define the performance of a classification model such as a decision tree.

Conclusion

We’ve continued our journey into supervised classification by creating a decision tree using all the default values in SAS Visual Statistics. As we continue to develop models in the AI an Analytics lifecycle, we will continue to witness even more interesting techniques. In my next post, I’ll cover in detail the output results along with their interpretation for this decision tree. If you are ready to learn more about decision trees, I can suggest the following two courses: SAS® Visual Statistics in SAS® Viya®: Interactive Model Building and Tree-Based Machine Learning Methods in SAS® Viya®. See you next time and never stop learning!

Find more articles from SAS Global Enablement and Learning here.

SAS Communities Library