03-21-2025
AndyRavenna
SAS Employee
Member since
05-14-2019
- 25 Posts
- 9 Likes Given
- 1 Solutions
- 0 Likes Received
-
Latest posts by AndyRavenna
Subject Views Posted 400 03-14-2025 01:47 PM 956 02-18-2025 01:28 PM 574 01-28-2025 12:45 PM 1935 01-06-2025 04:41 PM 1755 10-28-2024 03:54 PM 1742 10-14-2024 11:57 AM 895 10-07-2024 04:50 PM 1077 09-19-2024 02:49 PM 1465 06-21-2024 10:56 AM 1595 05-17-2024 05:25 PM -
Activity Feed for AndyRavenna
- Tagged Data-Driven Analytics in SAS Viya – Decision Tree Icicle Plot and Variable Importance on SAS Communities Library. 03-14-2025 01:48 PM
- Posted Data-Driven Analytics in SAS Viya – Decision Tree Icicle Plot and Variable Importance on SAS Communities Library. 03-14-2025 01:47 PM
- Posted Easily Turn Your Automated Explanation Into a Predictive Model Q&A, Slides, and On-Demand Recording on Ask the Expert. 02-18-2025 01:28 PM
- Tagged Data-Driven Analytics in SAS Viya – Decision Tree Model Results on SAS Communities Library. 01-28-2025 12:47 PM
- Posted Data-Driven Analytics in SAS Viya – Decision Tree Model Results on SAS Communities Library. 01-28-2025 12:45 PM
- Tagged Data-Driven Analytics in SAS Viya – Building a Decision Tree on SAS Communities Library. 01-06-2025 04:46 PM
- Posted Data-Driven Analytics in SAS Viya – Building a Decision Tree on SAS Communities Library. 01-06-2025 04:41 PM
- Tagged Data-Driven Analytics in SAS Viya – Logistic Regression Lift and ROC Charts on SAS Communities Library. 10-28-2024 03:57 PM
- Tagged Data-Driven Analytics in SAS Viya – Logistic Regression Lift and ROC Charts on SAS Communities Library. 10-28-2024 03:57 PM
- Tagged Data-Driven Analytics in SAS Viya – Logistic Regression Lift and ROC Charts on SAS Communities Library. 10-28-2024 03:57 PM
- Posted Data-Driven Analytics in SAS Viya – Logistic Regression Lift and ROC Charts on SAS Communities Library. 10-28-2024 03:54 PM
- Posted Re: Accessing and Using the SAS Viya for Learners Data Repository on SAS Communities Library. 10-14-2024 11:57 AM
- Tagged Data-Driven Analytics in SAS Viya – Logistic Regression Model Assessment on SAS Communities Library. 10-07-2024 04:54 PM
- Tagged Data-Driven Analytics in SAS Viya – Logistic Regression Model Assessment on SAS Communities Library. 10-07-2024 04:54 PM
- Tagged Data-Driven Analytics in SAS Viya – Logistic Regression Model Assessment on SAS Communities Library. 10-07-2024 04:54 PM
- Tagged Data-Driven Analytics in SAS Viya – Logistic Regression Model Assessment on SAS Communities Library. 10-07-2024 04:54 PM
- Posted Data-Driven Analytics in SAS Viya – Logistic Regression Model Assessment on SAS Communities Library. 10-07-2024 04:50 PM
- Tagged Data-Driven Analytics in SAS Viya – Logistic Regression Model Building on SAS Communities Library. 09-19-2024 02:53 PM
- Tagged Data-Driven Analytics in SAS Viya – Logistic Regression Model Results Interpretation on SAS Communities Library. 09-19-2024 02:52 PM
- Tagged Data-Driven Analytics in SAS Viya – Logistic Regression Model Results Interpretation on SAS Communities Library. 09-19-2024 02:52 PM
-
Posts I Liked
Subject Likes Author Latest Post 5 9 11 1 5 -
My Library Contributions
Subject Likes Author Latest Post 0 1 1 1 0
03-14-2025
01:47 PM
Getting Started
In today’s post, we'll finish looking at the results of a decision tree in SAS Viya by examining icicle plots and variable importance. In my previous post of this series, we examined autotuning and the tree-map of a decision tree built in SAS Visual Statistics. I also discussed how to create predicted values and invoke the interactive mode. Moving forward we will continue to focus on the part of the AI and Analytics lifecycle that involves developing and interpreting robust models. Specifically, we will finish examining the remaining pieces of output from the decision tree that was built using variable annuity (insurance product) data.
Insurance Data
Remember, the business challenge is trying to identify customers who are likely to respond to a variable annuity marketing campaign and make a purchase. The develop_final table that was introduced previously contains just over 32,000 banking customers. The input variables reflect both demographic information as well as product usage captured over a three-month period. The target variable is named Ins which is a binary variable.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Decision Tree Results
You may remember from my last post we ended by covering the following results in some detail:
The summary bar across the top of the page.
The Decision Tree, which is an interactive, navigational tree-map.
Today we will discuss the remaining output:
The Icicle Plot revealing a detailed hierarchical breakdown of the tree data.
The Variable Importance Plot displaying the importance of each variable.
The Leaf Statistics revealing counts and percentages for each node.
Icicle Plot
Introduced in the 1980's icicle plots evolved from treemap visualizations to better represent hierarchical structures like decision trees. Unlike traditional tree diagrams, which can spread outward and quickly become unwieldy, icicle plots provide a compact, stacked visualization. Using color and dimension, icicle plots allow users to trace decision paths, identify feature importance, and compare different trees efficiently.
The icicle plot for our insurance data captures the essence of what happens within the decision tree. Starting at the top with the longest blue bar we have the root node. By highlighting the top of the icicle plot, we can see in the pop-up that it represents Node ID 0 with all 32,264 observations and the majority (65%) are non-purchasers. Just like the original decision tree, this bar is labeled as Saving Balance which determines our first split in the tree. Moving down to the next line, we see two bars representing the result of that first split. If we highlight the blue BIN_DDABal bar, it reveals Node ID 1 with 24,770 observations and the majority (73%) are also non-purchasers. The pop-up also reveals that the split for Saving Balance occurred at a value of approximately $1,550. The yellow Saving Balance bar reveals that we'll have a second split occurring on the saving balance. However, at this second tier of the icicle plot, we find Node ID 2 with the remaining 7,494 observations and the majority (61%) are purchasers. The blue bars make it easy to identify the majority non-event nodes (non-purchasers), while the yellow bars represent the majority event nodes (purchasers). We can continue to work our way down the icicle plot validating the same information we found on the Decision Tree. In fact, the two items are linked such that if you highlight a node in the decision tree, the same node is highlighted in the icicle plot. The icicle plot displays an interesting visualization that allows us to view the decision tree results from a slightly different angle.
Variable Importance
Variable importance and variable importance plots can be very useful when trying to identify which variables have the greatest impact on a model's predictions. When working with models where an input can be included multiple times like decision trees or ensemble models such as random forests and gradient boosting machines, it can be difficult to decipher which inputs are most useful overall. Variable importance helps identify the most influential features and can include visualizations that provide ranking.
Examining the variable importance plot for our insurance data helps validate our understanding of the decision tree. Ranked at the top of the plot is Saving Balance. Since we remember that the first split in our tree is also based on Saving Balance, it is no surprise that it ranks near the top. Keep in mind that this is not always true, sometimes the most important variables might not be the ones near the top of the tree. In SAS Visual Statistics, the variable importance is a RSS-based variable importance measure. In other words, the variable importance measures are based on the change in the residual sum of squares when a split is found at a node. For our insurance data, we see a couple of "really important" inputs, followed by a few others that are less important, followed by inputs that don't even make it into the model. If you examine the details table, you will discover that inputs not included in the model have an importance value of 0.
Leaf Statistics
The final pieces of output that we will examine for decision trees involves frequency counts and percentages of the leaf nodes. They basically provide insights into the distribution of the data in these terminal nodes.
Of course, frequency counts represent the number of samples that end up in a particular leaf. With classification trees, they show how many instances belong to each class within a leaf. With regression trees, they indicate the number of samples or observations that contribute to the predicted value. When examining the count plot for our insurance data, it is clear that Node Id 9 contains the largest number of customers. The longer blue bar in that node indicates that the majority of customers in that node are non-purchasers (the non-event). Count values are available by mousing over the individual bars or by opening the details table.
Frequency count statistics in decision trees help assess the reliability of predictions. Leaves that contain higher sample counts tend to provide more stable and generalizable outputs. These counts can all assist in detecting things like class imbalances. In a world were building models that are fair and unbiased is important, examining frequency counts can ensure that decisions are not dominated by underrepresented classes. Finally, frequency counts can indicate weak nodes or overfit trees which can be addressed with pruning.
Percentages (also known as proportions of samples in a leaf) represent the fraction of total samples that fall into a given leaf node. With classification trees, they indicate the proportion of each class (event or non-event) in a leaf. This makes class dominance easy to identify. With regression trees, percentages help assess how much of the data influences a particular prediction. The percentage plot for our insurance data reveals that 3 nodes (6, 18, and 30) are the "purest" nodes in the table, with Node ID 6 having the highest percentage (73%) of purchasers. Percentage values are available by mousing over the individual bars or by opening up the details table.
Percentage statistics in decision trees are useful for many of the same reasons as were identified for frequency counts. The help assess confidence and reliability of predictions, they aid in detecting class imbalances, and they assist in pruning and model evaluation.
Thanks for joining me in discussing the three remaining pieces of output that are available when building decision trees in SAS Visual Statistics. This completes our examination of building and interpreting a decision tree model in SAS Viya. In my next series of posts, we'll examine ensemble models that can be built with SAS Visual Statistics. We'll use the same annuity data to keep things consistent. If you are ready to learn more about decision trees, I can suggest the following two courses: SAS® Visual Statistics in SAS® Viya®: Interactive Model Building and Tree-Based Machine Learning Methods in SAS® Viya®. See you next time and never stop learning!
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- decision trees
- GEL
- Insurance Data
- SAS Visual Analytics
- sas visual statistics
- SAS Viya
- Supervised Classification
02-18-2025
01:28 PM
1 Like
Watch this Ask the Expert session to learn just how easy it is to build a predictive model from an automated explanation.
Watch the Webinar
You will learn how to:
Create an automated explanation in SAS Visual Analytics.
Interpret the results of an automated explanation.
Perform data exploration on financial services data.
The questions from the Q&A segment held at the end of the webinar are listed below and the slides from the webinar are attached.
Q&A
In what version of SAS Visual Analytics did *automated explanations* become available?
I checked the documentation and saw that an Automated Analysis was available starting in version 8.3 of Visual Analytics.
Is visual part of the basic SAS package?
It depends on what version of SAS you have. However, I think the very basics of SAS Viya gives you access to things in addition to SAS Studio. You're going to have access to Visual Analytics. It's like a tiered package, but Visual Analytics, the automated explanation, and the automated predictions are all part of that base level of Viya. If you have access to Viya and you can open up Visual Analytics, you're going to be able to get to automated explanations.
How is association between explanatory variables shown?
If you open that details table, you'll see that there are one level decision trees that are used to determine which of these are the most significant.
Could you talk about the dataset used and best practices for dataset building for automated explanation? For this example, is it aggregated at the customer level, meaning each record is one customer and then the stats in the record are based on that customer's history?
That's a good question. You want raw data. You don't want aggregated data for your automated explanation because it's going to be able to do the aggregation for you. The table that I use, the VS_bank table, it turns out that each row is a unique customer. So, we have over 1,000,000 customers in there and a million independent customers. If I had data that had multiple rows per customer, I would not want to do that aggregation myself. I would want to let the automated explanation do it.
Automated explanation function is not included in SAS EG, correct?
That would be nice, wouldn't it? No, I'm afraid it's not. I'm an old Enterprise Guide user, so I know a lot of you are SAS coders. I went over to the dark side. I started in coding, and I got into Enterprise guide. That was my first venture into using point and click features and finding out how they can help me and sometimes allow me to get a report much more quickly than writing code. So, I'm totally with you, the automated explanation is not available there to you unfortunately from Enterprise Guide.
Is this model the best model with the best performance?
If you really want to start getting into model building, then what you'll want to do is you'll want to produce your own models like I created that logistic regression. We would start by building a logistic regression. Then we might choose two or three other models to consider if we want to stick with visual analytics. Let's say I build a logistic regression. I'm going to do an ensemble of trees like a forest, and I'm also going to do a neural network. I've got those three models here. In Visual Analytics, there is something called the model comparison object:
It will allow me to compare those three models using my favorite model fit statistic. Whether I want to use the KS statistic or the C statistic, it will share with me what the champion model is. Now, I will mention one other thing which is if you are really into building models and you want to build a whole bunch of models very quickly and then find the champion model, you might want to look at Model Studio. Model Studio allows you to build pipelines. Within those pipelines, you can have several different models, and you can build them very, very quickly. Sometimes the way that I like to talk about this is that if you want to get down and dirty into your own models and then start to pick your champion model, you may want to stick here to these objects and visual analytics. If you don't really want to get too down and dirty into the models, but you basically just want to take a whole bunch of models, throw them at the wall and see what sticks, Model Studio has some great features that will allow you to take that and do that.
What other objects are available in SAS Visual Analytics?
That is a great question, and I will show you. If we scroll up to the top of the Objects pane, you can see that it's got tables of your data.
It's got different graphs, like bar charts, bubble plots, heat maps, and time series plots. There are also Geo maps if you've got geographic data. If you're into building dashboards where you want to put a whole bunch of different charts together and have them interact, it's got controls. It's also got these basic analytics that are available to you. It'll do some forecasting. The simplest version of SAS Viya will give you some information, but you don't have a lot of flexibility. Down here, we've got a list of statistical objects, so things like nonparametric logistic regressions. We also have some machine learning objects like forest and gradient boosting. So, there are all kinds of different objects available in Visual Analytics. Depending on what version of Viya you have, you're going to get more and more functionality.
Recommended Resources
Automated Explanation in SAS Visual Analytics
Please see additional resources in the attached slide deck.
Want more tips? Be sure to subscribe to the Ask the Expert board to receive follow up Q&A, slides and recordings from other SAS Ask the Expert webinars.
... View more
- Find more articles tagged with:
- SAS Visual Analytics
- SAS Viya
Labels:
01-28-2025
12:45 PM
1 Like
Getting Started
In today’s post, we’ll take a look at how to interpret the results of a decision tree built in SAS Viya. In my last post of this series, I showed you just how easy it was to build a decision tree in SAS Visual Statistics. I also discussed the origins of decision trees and some of the options available when building them. Moving forward we will continue to focus on the part of the AI and Analytics lifecycle that involves developing and interpreting robust models. Specifically, let’s examine the various pieces of output from the decision tree that was built using variable annuity (insurance product) data.
Insurance Data
Remember, the business challenge is trying to identify customers who are likely to respond to a variable annuity marketing campaign and make a purchase. The develop_final table that was introduced previously contains just over 32,000 banking customers. The input variables reflect both demographic information as well as product usage captured over a three-month period. The target variable is named Ins which is a binary variable.
Autotuning
Remember in my previous post, we examined the Summary Bar to find the model fit statistic KS (Youden) with a value of 0.4098. This statistic ranges from 0 to 1 with higher numbers indicating a better model. It's fair to ask the question "Can we do better?" In fact, there is a surprisingly easy feature that we can take advantage of. It is called autotuning. Autotuning is a process where a model's hyperparameters are automatically adjusted using algorithms to create several versions of the same model. These versions are then compared to find out which set of hyperparameters works best for that model. But wait a minute, what is a hyperparameter? A hyperparameter is simply one of the model options that can be fine-tuned in order to improve a model's performance. Unfortunately, optimal values for your hyperparameters cannot be calculated from the data. So, if you don't use autotuning, you are stuck with trial and error to try and find better values for your model.
A good example of a hyperparameter for decision trees is the number of levels in the tree. In other words, how many levels deep should we make our decision tree for the insurance data? Should it be a maximum of 6 levels deep (which is the default value)? Or could I improve the model's KS value by adding more levels or less levels to the tree? I could try building a whole bunch of decision trees with varying levels and compare them, but that would be so time consuming! This is where autotuning can help us out.
Selecting Autotune in the options pane of the decision tree object will open up the Autotuning Setup window.
The Autotuning Setup window basically allows us to do two things. It controls how the tuning should proceed, and which hyperparameters should be autotuned. We don't have time to cover all these features in detail, but let's cover this window at a very high level. (The documentation has many more details.) The Search method is the algorithm that determines the hyperparameters values that are used in each iteration of the autotuning process. It defaults to the Genetic algorithm that is initially seeded from Latin hypercube samples. The Objective metric is the statistic that is used to compare the autotuned models against each other. For categorical response models, Misclassification rate is the default. For measure response models, Average square error is the default.
Maximum minutes is the maximum amount of time that the model training process runs in minutes. Maximum number of models is the maximum number of different models that are created during the autotuning process. You can open up the Hyperparameters section in this window to reveal the parameters that are autotuned for decision trees are the following: maximum levels, leaf size, and predictor bins. Remember we already covered maximum levels and leaf size in the last post. The Predictor bins option specifies the number of bins used to categorize any and every predictor that is a measure variable. The default is 50. Adding more bins increases model complexity because the decision tree has more potential splits to consider. Increasing the number of bins can also lead to finer granularity in capturing patterns as well as potential overfitting.
When I perform autotuning on our insurance data, I get the following results.
The Maximum levels is tuned from 2 to 12. The Leaf size is increased from 6 to 231. The Predictor bins is tuned from 50 to 155. The great news is that as a result of autotuning, the KS statistic also increased from 0.4098 to 0.4380.
Decision Tree Results
You may remember from my last post we ended by covering the following results at a very high level:
The summary bar across the top of the page.
The Tree Window containing both the Decision Tree and the Icicle Plot.
The Decision Tree, which is an interactive, navigational tree-map.
The Icicle Plot revealing a detailed hierarchical breakdown of the tree data.
The Variable Importance Plot displaying the importance of each variable.
The Confusion Matrix revealing the correct and incorrect classifications.
We want to take a deep dive into each of these items, but let’s make it a little easier on ourselves by taking advantage of a Model Display option. We've done this before and it's easy to just open the Options pane of the Decision Tree and scroll down to the Model Display options. Under the General category, change the Plot layout from Fit(default) to Stack. This model display option specifies how the subplots are displayed on the canvas. By default, we see all the output subplots shown together on one page. We can enhance viewability by changing the Plot layout to Stack such that each subplot fills the canvas. Using this option, a control bar enables you to move between subplots.
Summary Bar
Examining the Summary Bar at the top of the of the canvas lets us know several things. We have created a decision tree on the target variable Ins. Our model has chosen an event level of 1, which means our model is designed to predict those customers that purchase an annuity. After autotuning, the default model fit statistic KS (Youden) now has a value of 0.4380. And there were some 32K observations used in the building of this model.
Decision Tree (tree-map)
This interactive and navigational decision tree displays the node statistics and the node rules. To navigate the decision tree easily, you can use the mouse. Click the mouse button and hold it down anywhere in the Tree window to move the decision tree within the window. Scroll to zoom in and out of the decision tree. Scroll up to zoom in and scroll down to zoom out. The zoom is always centered on the position of the mouse pointer.
We start with the tree completely zoomed out, so we have the least amount of detail. The color of the node in the tree-map indicates the predicted level for that node. It is indicative of the event level that has the most observations in the node. We can see that we have a mixture of nodes that are primarily purchasers (yellow) and primarily non-purchasers (blue). We can also see that the first split (the root node at the top of the tree) is based on Saving Balance. The first split in a decision tree plays a pivotal role in shaping the tree's structure, performance, and ability to generalize. It sets the stage for the entire model's accuracy and efficiency. Changes near the top of the tree typically cascade down through the remainder of the tree. Let's zoom in on one of the terminal nodes and see what kind of detail is revealed.
Zooming in to and selecting Node ID 8 reveals an amazing amount of detail about these 1,491 customers. First, I can see that they have just over a 65% predicted probability of being purchasers. Second, I can see the tree path or rules that were followed to create this node. The BIN_DDABal (binned checking balance) has to be 2, 9, or 10. The customer must own a Certificate of Deposit, and their Saving Balance must be less than around $1,550 or missing.
In addition to examining the nodes and rules of the tree-map, there are more features available in the Tree window. You can derive a leaf ID variable. This action creates a category variable that contains the leaf ID for each observation. You can use this variable in other objects throughout SAS Visual Analytics. You can also derive predicted values. Depending on the type of response variable, new variables will be created including predicted values, probability values, and prediction cutoffs. To derive predicted values, right-click in the Tree window, and select Derive predicted.
Interactive Mode
Let's suppose that as a data scientist, you happen to have some business knowledge and you want to modify the design of the decision tree. Just like a gardener would tend to tree, in Interactive Mode you can tend (redesign) the decision tree. To enter Interactive Mode, you can right-click on the decision tree and select the task. To split a leaf node, right-click on a node and select Split (for leaf nodes) or Edit split (for non-leaf nodes). In the Split Node window, you select the variable that is used to split the node. To train a leaf node, right-click on a leaf node and select Train. In the Train Node window, you can specify the variables and the maximum depth of training to train a node. To prune a node, right-click on an interior node, and select Prune to prune the decision tree at that node. This removes all nodes beneath the selected node and turns that node into a leaf node. There is plenty more functionality available to you in Interactive Mode. I encourage you to check out the documentation.
Conclusion
We’ve continued our journey into supervised classification by autotuning a decision tree and beginning to interpret the output in SAS Visual Statistics. As we continue to develop models in the AI an Analytics lifecycle, we will witness even more interesting features and options. In my next post, I’ll finish covering the output results for this decision tree. If you are ready to learn more about decision trees, I can suggest the following two courses: SAS® Visual Statistics in SAS® Viya®: Interactive Model Building and Tree-Based Machine Learning Methods in SAS® Viya®. See you next time and never stop learning!
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- AUTOTUNING
- decision trees
- GEL
- Insurance Data
- SAS Visual Analytics
- sas visual statistics
- SAS Viya
- Supervised Classification
01-06-2025
04:41 PM
1 Like
Getting Started
In today’s post, I'll show you how to create decision trees in SAS Visual Statistics. We'll see just how easy it can be to apply this versatile and well-known technique in SAS Viya. This post is my final installment for the year in my comprehensive review of data-driven analytics. We will continue to focus in on the part of the AI and Analytics lifecycle that involves developing robust models using best practices. In previous posts I've discussed the many types of classification techniques including supervised and unsupervised methods. In addition, I've already demonstrated clustering and logistic regression. Today, we will learn how to create, interpret, and optimize decision trees in SAS Visual Statistics.
Introduction
Decision trees are one of the most versatile tools in predictive modeling, offering intuitive insights into complex datasets. SAS Visual Statistics in SAS Viya (a powerful in-memory analytics platform) simplifies the process of building and analyzing decision trees. Whether you're a data scientist, business analyst, or beginner exploring machine learning, SAS Visual Statistics provides an interactive and efficient way to build robust decision trees.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
History
Decision trees, a foundational tool in data science and machine learning, have a rich history rooted in statistical research and decision theory. Their origins trace back to the mid-20th century when statisticians began exploring algorithms in order to simplify decision-making. The concept was heavily influenced by Claude Shannon’s Information Theory, which introduced both measures of entropy and information gain. These are key principles that are still used in modern decision tree algorithms. Early decision tree applications focused on optimizing decision-making in uncertain environments. They eventually found their way into fields like economics, operations research, and artificial intelligence. Over time, as these algorithms became more sophisticated, they evolved from simple binary splits to more complex techniques that could handle both multi-class problems and continuous variables.
By the 1980s, decision trees gained importance with the development of ID3 (Iterative Dichotomiser 3) by Ross Quinlan. ID3 is an algorithm designed to construct trees efficiently by selecting attributes that maximize information gain. Eventually more robust versions like C4.5 and CART (Classification and Regression Trees) introduced pruning techniques to reduce overfitting and improve model generalization. As machine learning has moved into the forefront of business analytics, decision trees became a core building block for ensemble methods like Random Forests and Gradient Boosted Trees, further boosting their predictive power. Today, decision trees remain an essential part of data science, valued for their interpretability, efficiency, and versatility in solving both classification and regression problems.
Insurance Data
As we've done in earlier posts, let's assume that we are a data scientist attempting to use variable annuity (insurance product) data. The table named develop_final is used to identify customers who are likely to respond to a variable annuity marketing campaign. The develop_final table contains just over 32,000 banking customers and input variables that reflect both demographic information as well as product usage captured over a three-month period. The target variable is Ins which is a binary variable. For Ins, a value of 1 indicates the customer purchased a variable annuity product and a 0 indicates they did not. Please take note that I have performed some data clean-up (including binning, transforming, and imputation) and variable selection (using variance explained) so that we are ready to build supervised models. If you’re interested in seeing some of those data cleansing techniques performed on the develop table, please see Supervised Machine Learning Procedures Using SAS® Viya® in SAS® Studio.
Building a Decision Tree in SAS Visual Statistics
From SAS Drive, we open the Visual Analytics web application by clicking the Applications menu icon and selecting Explore and Visualize. From the Explore and Visualize window, we click on New report. In the left-hand Data pane, select Add data. Find and select the DEVELOP_FINAL table and then Add.
With the data already cleaned and prepped for model building, we are ready to create our logistic regression. On the left, we’ll change from the Data pane to the Objects pane by selecting Objects. From this list we can scroll down until we find the list of Statistics objects. From there we can either double-click or drag-and-drop the Decision Tree object onto the first page of the report.
Next click on the Data Roles pane on the right. Assign Ins as the Response and all 36 remaining explanatory variables as Predictors.
Decision Tree Options
Select the Options pane on the right. Before we examine the output for this decision tree object, let’s note that there are many options that are available for building decision trees. The developers at SAS carefully select default values that almost always work well with both the software and your data. Let's examine just a few of those default options for decision trees.
The Maximum branches option specifies the maximum number of branches for each node split. By default, you will construct a tree with just two-way splits. SAS Visual Statistics will allow you to have up to 10 splits per node. Two-way splits yield flexibility and are computationally efficient and widely used. They tend to build deeper trees with simpler splits. Multi-way splits yield shallower trees that are more computationally expensive and at risk for overfitting. They are best reserved for categorical inputs that have multiple distinct levels.
The Maximum levels option specifies the maximum number of levels in the tree. By default, your decision tree will have no more than 6 levels. SAS Visual Statistics will allow you to build trees up to 20 levels deep. Trees with fewer levels tend to be simpler, easier to interpret, and less prone to overfitting. They may underfit complex data by oversimplifying relationships. Trees with more levels can capture intricate patterns and handle complex datasets well. They are often harder to interpret, computationally expensive, and prone to overfitting. Shallow trees generalize better for either small or noisy datasets, while deep trees excel in larger, detailed datasets. Choosing a good depth requires balancing the tradeoff between bias (shallow trees) and variance (deep trees). For more discussion of the bias-variance trade-off, check out the excellent and comprehensive Statistics You Need to Know for Machine Learning.
The Leaf size option specifies the minimum number of cases that are allowed in a leaf node (also known as a terminal node). By default, the minimum number of observations for an individual leaf is 5. Decision trees with smaller leaf sizes tend to be more detailed and capture finer patterns. While this can improve accuracy on training data, it increases the risk of overfitting. Trees with larger leaf sizes can generalize better (less prone to overfitting) and are more computationally efficient. Of course, this comes at the risk of underfitting by oversimplifying the data. In general, smaller leaf size favors complex models and larger leaf size favors simpler models.
Decision Tree Results
While there are other options that are available for investigation, it is time to examine the output for our decision tree based on the annuity data.
Examining the Summary Bar at the top of the of the canvas reveals a few things. We have created a decision tree based on the target variable Ins. Our model has chosen an event level of 1, which means our model is designed to predict those customers that purchase an annuity. The default model fit statistic is KS (Youden) with a value of 0.4098. This statistic ranges from 0 to 1 with higher numbers indicating a better model. And there were 32K observations used in the building of this model.
Underneath the summary bar there are decision tree results including the Tree Window, Icicle Plot, Variable Importance, and Confusion Matrix. Actually, the Tree Window contains both the Tree and the Icicle Plot. Let’s define all of these at a very high level and save the details for my next post. The Tree is a tree-map, which is a navigational tool we can use to interactively investigate the node statistics and node rules. The Icicle Plot gives us a detailed hierarchical breakdown of the tree data, starting with the root node on top. The Variable Importance plot gives us the most important information for all effects in the tree. And finally, the Confusion Matrix displays a summary of both the correct and incorrect classification for both the “event” and the “non-event.” It helps to define the performance of a classification model such as a decision tree.
Conclusion
We’ve continued our journey into supervised classification by creating a decision tree using all the default values in SAS Visual Statistics. As we continue to develop models in the AI an Analytics lifecycle, we will continue to witness even more interesting techniques. In my next post, I’ll cover in detail the output results along with their interpretation for this decision tree. If you are ready to learn more about decision trees, I can suggest the following two courses: SAS® Visual Statistics in SAS® Viya®: Interactive Model Building and Tree-Based Machine Learning Methods in SAS® Viya®. See you next time and never stop learning!
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- decision trees
- GEL
- Insurance Data
- SAS Visual Analytics
- sas visual statistics
- SAS Viya
- Supervised Classification
10-28-2024
03:54 PM
In today’s post, we'll finish our assessment of a logistic regression model built in SAS Viya by examining lift and ROC charts. In my previous post of this series, we began our assessment of a logistic model in SAS Visual Statistics. We examined the confusion matrix, the misclassification plot and the cutoff plot. We want to cap off this discussion by returning to the set of outputs from that logistic regression model known as assessment plots and focusing on lift and ROC charts. We will continue to focus on the part of the AI and Analytics Lifecycle that involves developing and interpreting robust models. Specifically, let’s examine the remaining pieces of output from the logistic regression model that was built using variable annuity (insurance product) data.
Let's keep in mind that the business challenge is trying to identify customers who are likely to respond to a variable annuity marketing campaign and make a purchase. It's very likely during the AI and Analytics Lifecycle we will build more than just a logistic regression model. In fact, in my next post, we will build a decision tree with the same annuity data. With multiple models, how will we decide which is the "best" model. This is where model assessment comes into the picture, including lift and ROC charts. These assessment charts and others will help us evaluate model performance along with aiding us in selecting the best competing model that meets our business goals.
Let's begin by examining the lift chart. First, a little history on lift charts. Lift charts were developed as a practical tool for evaluating predictive models, especially in fields like direct marketing, where businesses needed to identify high-value customers for targeted campaigns. Originating in the 80's and 90's, they addressed the need to visualize how well models could prioritize likely responders compared to random selection. Even though traditional metrics like accuracy were available, unfortunately that statistic doesn't fully capture this concept of prioritization. Lift charts plot the improvement of a model over random selection, helping to show the added value of targeting top segments of the model's predictions. As data mining and machine learning advanced, lift charts became a standard tool in model evaluation. They offer clear insights into a model's effectiveness in various applications and have the benefit of being a visualization tool. To summarize, the lift chart is a graphical representation of the advantage (or lift) of using a predictive model to improve upon the target response versus not using a model at all.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
To create a lift chart of our annuity results, we rank-order our customers based on their likelihood of a positive outcome or purchase. Next, divide the ranked data into equal-sized groups. In our chart we will use percentiles. For each group, the cumulative percentage of actual positive outcomes is calculated and compared to the baseline percentage expected from random selection. Finally, chart the cumulative percentages with the x-axis representing the percentage of the population targeted and the y-axis showing the cumulative lift. This allows us to see how effectively the model outperforms random selection. And this makes sense as an assessment statistic because if our model cannot out-perform finding purchasers at random, then why bother using the model?
Let's examine the lift chart from our logistic regression model.
The baseline model is not actually plotted on the chart, but it is easy to visualize. It is the horizontal line that would lay flat on the x-axis and represents a cumulative lift value of 1.0 for all percentiles. That baseline model reflects the behavior we would see if we just went in and randomly guessed who the purchasers were without the help of a model. In other words, if we were to target 10% of the population, we would expect to find 10% of the purchasers. The lift value of 1 serves as a benchmark and reflects no advantage from using a predictive model like a logistic regression. The blue line represents the performance of our logistic regression model. Higher lift (especially at the lower percentiles) is better. We could actually compare the lift line of one model to the lift line of another model on the same chart. The model with the higher lift would be the better performing model. We will discuss exactly why in just a minute. The other line plotted here is the yellow line and it represents the best model achievable, or a perfect classifier. It can be useful to us because it shows us the possibilities of performance if we had a stronger model. Think of it as the upper limit of where the blue line could reach at each of the percentiles.
Let's focus in on just one piece of the lift chart to further explain how this plot may be used.
If we mouse over the blue line at the 5 th percentile, we get information about the model performance. The logistic regression model has a lift of approximately 2.25 at this percentile. Another way to think about this is as follows: if we were to contact the top 5 percent of our customers, we are over twice as likely to reach a responder versus just picking customers at random. Not bad at all! If we were to compare our logistic regression model against another model that had a higher lift at the 5 th percentile, we would consider the other model to be the better performer. Since the data has been rank-ordered by likelihood of responding, the lower percentiles are more meaningful than the higher percentiles. At this point it should be clear that all other things being equal, higher lift is preferable. One final note on lift charts is since these lift calculations do not depend at all on a model's cutoff value, lift charts are unaffected by a change in cutoff value.
Now let's focus on the ROC or receiver operating characteristic chart. The history of ROC charts is pretty fascinating. These charts were originally developed during World War II (during the 40's) to identify true signals (versus noise) for radar data. This allowed radars to be designed to detect enemy aircraft, ship, or missile against the real-world background noise of clouds, birds, or other non-threatening equipment. The ROC curve provides a method to evaluate the trade-off between true positive rates (correctly identifying a signal) and false positive rates (incorrectly identifying noise as a signal). A ROC analysis can help optimize a radar's sensitivity to maximize real threat detection and minimize false alarms. Since that time, in the 60's and 70's, ROC charts were adapted for the medical field. As an example, diagnostic tests could be evaluated on their ability to correctly detect disease while avoiding false alarms. In the 80's and beyond, the ROC curve became a well-used tool for evaluating binary classification models.
Since the ROC or receiver operating characteristic chart is a plot of the True Positive Rate against the False Positive Rate, you can read this previous post to review the basic definitions of true and false positive counts as well as true and false negative counts. The True Positive Rate (TPR) is defined as the number of True Positives divided by the total of both the True Positives and the False Negatives. TPR is also known as sensitivity or recall. You can think of it as the proportion of actual positives that were correctly identified by the model. The False Positive Rate or FPR is defined by the number of False Positives divided by the total of both the False Positives and the True Negatives. It is also known as 1 - specificity, where specificity is also known as the True Negative Rate. Think of FPR as the proportion of actual negatives that were incorrectly classified as positives.
To create a ROC chart of our annuity results, we calculate both the TPR and the FPR at each cutoff value (over the entire range from 0 to 1). The False Positive Rate is plotted on the x-axis and the True Positive Rate is plotted on the y-axis for each cutoff value. This typically results in a curve starting at (0,0) and ending at (1,1). Let's go ahead and look at the ROC chart from the logistic regression.
The blue line represents the performance of our logistic regression model. You can think of the ROC chart as a representation of how well our model is avoiding misclassifications for the "events" or with our data, purchasers. The "bigger" the curve, the "better" the model. In the ideal world, the curve would stretch out to reach the upper left-hand corner of the graph at (0,1). In fact, the perfect classifier would start at the top left corner (0,1) and continue as a horizontal line until the top right corner of (1,1). The blue curve would completely fill up the upper left corner of the chart, representing a TPR of 1 and FPR of 0 for all cutoff values. The diagonal dashed line plotted from (0,0) to (1,1) represents a random classifier and should be considered for a baseline comparison. A nice summary statistic that is used to assess a model's performance is the AUC or Area Under the Curve. The AUC (also known as the c-statistic or concordance statistic for binary models) typically ranges from .5 (indicating random guessing) to 1.0 (indicating the perfect classifier). Thus, for AUC, higher is better.
You also might have noticed a dashed blue vertical line reaching from the baseline model to the model's curve.
This vertical line represents the maximum vertical difference between the baseline model curve and the model's ROC curve (in our case, the logistic regression). This distance represents Youden's Index and corresponds to the optimal balance between sensitivity and specificity. From this one point on the curve, we are getting two good pieces of information. We can use Youden's (not unlike the Kolmogorov-Smirnov) statistic to compare models. Of course, higher is better with this index. And we are also getting a suggested cutoff value. In this example, a cutoff of 0.32 is where the model achieves the best trade-off between correctly identifying positive cases (high TPR) and correctly rejecting negative cases (high TNR). Just in case you were wondering, the KS statistic and Youden's Index both measure a model's ability to distinguish between two classes, but they are slightly different. They focus on different aspects of the model's performance. Youden's is calculated from the sensitivity and specificity, while KS is calculated from the cumulative distribution functions of predicted scores. One final note on the ROC chart is that (like the lift chart) it is unaffected by a change in the cutoff value because it already contains the entire range of cutoff values.
We’ve done a great job today of discussing the remaining two of the five assessment plots that are available for a logistic regression model built in SAS Visual Statistics. And that completes our examination of building and interpreting a logistic regression model in SAS Viya. In my next post we’ll finish up with investigating categorical targets and look at the decision tree model. We'll use the same annuity data to keep things consistent and learn all about the commonly used and easily interpretable decision tree.
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- GEL
- Lift chart
- logistic
- Model Assessment
- Model Comparison
- regression
- ROC chart
- SAS Viya
- visual analytics
- Visual Statistics
10-14-2024
11:57 AM
There is a link to the repository under “Additional Information”.
FYI, the dataset list and folder names were inherited from the previous version. Some listed folder names are incorrect, and some datasets are no longer available. The updates are currently being made.
... View more
10-07-2024
04:50 PM
In today’s post, we'll take a look at how to assess a logistic regression model built in SAS Viya. In my third and fourth post of this series, I showed you just how easy it was to build and interpret a logistic model in SAS Visual Statistics. We want to cap off this discussion by returning to a set of outputs from that logistic regression model known as assessment plots. We will continue to focus on the part of the AI and Analytics lifecycle that involves developing and interpreting robust models. Specifically, let’s examine the remaining pieces of output from the logistic regression model that was built using variable annuity (insurance product) data.
Remember, the business challenge is trying to identify customers who are likely to respond to a variable annuity marketing campaign and make a purchase. The develop_final table that was introduced previously contains just over 32,000 banking customers. The input variables reflect both demographic information as well as product usage captured over a three-month period. The target variable is named Ins which is a binary variable.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Since we’ve already created a logistic regression model and begun to examine the output, let’s dive right into exploring the various plots of assessment that Visual Statistics provides to us. You may recall from my last post that we ended by giving a high-level overview of the Confusion Matrix. The confusion matrix reveals the correct and incorrect classifications of the model based off of a .50 cutoff probability. This cutoff value is the default for SAS Visual Statistics.
The confusion matrix is computed by comparing our predictions to the actual target values. True positives are events that the model correctly predicted to be events (in our data, purchasers classified as purchasers). True negatives are non-events that the model correctly predicted to be non-events. False positives are non-events that the model incorrectly predicted to be events (in our data, a non-purchaser in the original data classified as a purchaser by our model), and false negatives are events the model incorrectly predicted to be non-events.
As we discussed in the previous post, it appears that this model is doing a much better job of correctly classifying a non-event (or non-purchaser in our case) as opposed to correctly classifying an event. Approximately 88% of non-purchasers are correctly classified by our logistic regression, while only about 44% of the purchasers are correctly classified. Does this mean that our model is “bad”? Or perhaps, could there be something else effecting these classifications? Before we answer that question directly, let's look at another assessment plot which is very closely related.
Right-click on the confusion matrix to discover there are a total of 5 assessment plots:
Confusion Matrix
Lift
ROC
Cutoff plot
Misclassification
Let's select the Misclassification plot to examine the results.
The Misclassification plot displays how many observations were correctly and incorrectly classified for each value of the response variable. Honestly, this is the same information contained in the confusion matrix, only here it is grouped by value of the response variable. It is clearly visible by examining the first bar (the bar representing a target value of 1) that there are more incorrect classifications (approximately 6000) than there are correct classifications (approximately 4000). Another way to think of that is the yellow portion of that bar is larger than the blue portion. If we look at the target value of 0, we can see that the blue portion of the bar is much larger than the yellow. The Misclassification plot is based on the exact same values as the Confusion Matrix we saw above. So, it should be no surprise we see further validation that this logistic regression model is doing a much better job of correctly identifying the non-purchasing customers rather than the purchasing customers.
Did you know that both of the plots we are discussing are highly sensitive to the cutoff probability? That's right. If we change the default cutoff value of .50 to something higher or lower, the frequency counts of the true and false, positives and negatives, will most likely change as well. And this begs the question, can we change the cutoff value to improve the performance of this model? And how would we define or quantify that improvement? Would we settle for just improving the number of true positives (identifying a purchaser as a purchaser)? But what if that causes us to have less true negatives? The answer lies in the business objective that is stated for the analysis. In our case, remember we are trying to identify customers who are likely to respond to a variable annuity marketing campaign and make a purchase. Another way of thinking about this is that we are trying to maximize the total number of "true" or correct classifications. Whether that classification is negative or positive, maybe is not as important in this scenario. We want to find a model that is going to identify a purchaser as a purchaser and a non-purchaser as a non-purchaser. A great way to find that optimal cutoff value is to examine the Cutoff plot.
The Cutoff plot enables you to visualize how the cutoff value affects model performance. On the x-axis, the full range of cutoff values is plotted from 0 to 1. The cutoff value is specified in the Prediction cutoff option and is represented by the vertical line in the plot. You can drag the vertical line to adjust the cutoff value, which reassesses the model. The cutoff value is currently set at the default of .50. Remember if we were to move this cutoff all the way down to 0, then all rows of the data would be classified as an "event" or a value of 1 or in our case, a purchaser. That also means that we would capture 100% of the purchasers correctly, but we would not correctly identify any of the non-purchasers. In the other extreme, we could move the cutoff all the way up to 1. In that scenario, we correctly capture all of the non-purchasers, but none of the purchasers. So, where should we set the cutoff? The chart can give us a clue.
Let's also note that on the y-axis of the Cutoff plot, we have three different statistics represented by lines. The Accuracy rate is the total number of correct predictions divided by the total number of predictions. Sensitivity (also known as recall or the true positive rate) is a measure of the model's ability to correctly identify the positive class or the 1's. Specificity (also known as the true negative rate) is a measure of the model's ability to correctly identify the negative class or the 0's. As the cutoff moves from 0 to 1, you can see that the sensitivity is always decreasing and the specificity is always increasing. In our case, we'd like to maximize both the sensitivity and the specificity which is reflected in their intersection on this plot. This strategy works well for our business objective and give equal importance to both. This method of selecting a cutoff works when both error types are equally costly.
Dragging the vertical line to the intersection of sensitivity and specificity changes the cutoff value from .50 to .34. Since both the Confusion Matrix and the Misclassification plot are affected by this change in cutoff value, it will be interesting to see if they look "better" than as we originally witnessed them.
Examining the Confusion Matrix with the updated cutoff reveals that the diagonals are similar in color. Remember, darker color indicates a higher proportion of the value in that cell to the number of observed values for that level (or row). Previously, the cell of true negatives was much darker than the cell of true positives. Here, we're seeing a very nice balance. In other words, we a similar proportion in both of those diagonal cells. This is what we were aiming for when updating the cutoff. We wanted to increase the number of true positives, but not so much as to the detriment of the true negatives.
We see a similar improvement to the Misclassification plot. Previously, first bar (the bar representing a target value of 1) revealed more incorrect classifications than correct classifications. Now, the number of correct classifications has improved. Examining both of the bars at the same time, we see a relatively proportional number of correct classifications. In the previous plot, the proportions were imbalanced with the second bar proportionally favored with correct classifications.
In general, when trying to select an appropriate cutoff value, you can start by examining the percentage of events (also known as the base rate) of the original population of the target variable. In our sample of data in the develop_final table, the percentage of events (of purchasers) is approximately 35%. Many consider the percentage of events in the original population to be an excellent starting point for selecting a cutoff. By examining the Cutoff plot from the Visual Statistics logistic regression model, we ended up selecting a cutoff of 34%, nearly identical. This should be an indication that we are in the right neighborhood. There is always time for further fine-tuning of the model and the cutoff. Choosing a probability cutoff level will always depend on the specific goals of the analysis, the costs of classification errors, and the characteristics of the data. Other approaches that were not discussed in this post include:
Cost of Errors
Using the ROC (Receiver Operating Characteristic) Curve
Precision-Recall Tradeoff
Maximizing F1 Score
Cross-Validation
Maximizing Profit/Utility
Interesting enough, the final two assessment plots remain unaffected by a changing cutoff. The ROC chart actually includes the range of all the cutoff values, while the Lift chart is created independently from the cutoff.
To see those final two pieces of assessment, you’ll need to read my next post! We’ve done a great job today of discussing three of the five assessment plots that are available for a logistic regression built in SAS Visual Statistics. But we still need to discuss the ROC chart and the Lift Chart. In my next post we’ll finish up with logistic regression by discussing these two plots. We want to chat about how, when, and where they can be useful when assessing models. Thank you for continuing your journey with me of developing models in the AI and Analytics lifecycle. As I’ve mentioned before, if you are ready to learn more about logistic regression, I can suggest the following two courses: SAS® Visual Statistics in SAS® Viya®: Interactive Model Building and Predictive Modeling Using Logistic Regression. See you next time and never stop learning!
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- cutoff
- GEL
- logistic
- misclassification
- regression
- SAS Viya
- visual analytics
- Visual Statistics
09-19-2024
02:49 PM
3 Likes
In today’s post, we'll take a look at how to interpret the results of a logistic regression model built in SAS Viya. In my third post of this series, I showed you just how easy it was to build a logistic model in SAS Visual Statistics. I also discussed the origins of regression models along with the details of logistic regression. Moving forward we will continue to focus on the part of the AI and Analytics lifecycle that involves developing and interpreting robust models. Specifically, let’s examine the various pieces of output from the logistic regression model that was built using variable annuity (insurance product) data.
Remember, the business challenge is trying to identify customers who are likely to respond to a variable annuity marketing campaign and make a purchase. The develop_final table that was introduced previously contains just over 32,000 banking customers. The input variables reflect both demographic information as well as product usage captured over a three-month period. The target variable is named Ins which is a binary variable.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Since we’ve already created a logistic regression model, let’s dive right into exploring the various pieces of output that Visual Statistics provides us. You may recall from my last post that we ended by giving a high-level overview of the following:
The summary bar across the top of the page.
The Fit Summary exhibiting the importance of the input variables.
The Odds Ratio Plot displaying odds ratio estimates.
The Residual Plot showing the residual of each observation.
The Confusion Matrix revealing the correct and incorrect classifications.
We want to take a deep dive into each of these items, but let’s make it a little easier on ourselves by taking advantage of two options. First, let’s open the Options pane of the logistic regression and scroll down to the Model Display options. Under the General category, change the Plot layout from Fit(default) to Stack. This model display option specifies how the subplots are displayed on the canvas. By default, we see all the output subplots shown together on one page. We can enhance viewability by changing the Plot layout to Stack such that each subplot fills the canvas. Using this option, a control bar enables you to move between subplots.
Second, scroll back up to the Logistic Regression options. Under the General category, let’s change the Variable selection method from the default of none to the Fast Backward method. Keep the default Significance level of 0.01. You might recall that we built this logistic regression model with a total of 37 explanatory variables. There are many reasons we want to avoid including too many effects in this model including: the possibility of overfitting, multicollinearity, and loss of interpretability. None of these are desirable traits. Parsimonious models, which achieve a good balance between simplicity and accuracy, are generally preferred. As I like to say: “simpler is better.” Using the fast backward method now will most likely cause a reduction in the number of input variables. Fast backward is a technique that uses a numeric shortcut to compute the next selection iteration quicker than the regular backward method. Any effects that do not meet our significance level of 0.01 will now be removed from the model.
Now that we have a large footprint of the Fit Summary plot and we’ve removed the insignificant effects, it is time to see exactly what this plot is telling us.
The Fit Summary window reveals the most significant predictor variables that affect the response variable. It displays the effects on the Y axis and the p-values on the X axis. The variable importance is based on the negative log of the p-value. The larger this value is, the more important the variable. You can determine the importance of the effect shown by examining the color, length, and location of the horizontal bars. The most important effects appear at the top of the plot. A blue bar shows that the variable importance is above the significance level. The longer the blue bar, the more meaningful the variable. It appears that in this logistic regression model, there are a total of 19 (out of 37) significant effects. At the very bottom of the plot, we see the first couple of inputs (Age and Amount Deposited) that have no bars. This indicates they have been removed from the model due to the variable selection option that was engaged.
It is also interesting to note that degrees of significance are indicated by the lightness or darkness of the color. For example, dark blue is most significant. Light blue is not as significant because the p-value is very close to the default p-value of .05. The default significance level (also known as alpha) is set to .05 and is plotted as a black vertical line in the pane. Hover your mouse pointer over the line to see the alpha and -log(alpha) values. You can move this line to change the significance level, but it will not affect the variable reduction option. Moving this line will only affect the degree of significance in relation to the lightness or darkness of the colored bars. The histogram bars at the very bottom of the graph displays the percent of the data that falls within the displayed range.
The next plot that is listed on the control bar is the Odds Ratio Plot.
Odds ratio estimates compare the odds of an event in one group to the odds of the event occurring in another group. Once the parameter estimates for the logistic regression predictor variables have been computed, it is very easy to calculate the odds ratio. Mathematically speaking, the odds ratio for a predictor variable is obtained by simply exponentiating its corresponding parameter estimate. Odds ratios are particularly useful for interpreting the effects of both categorical and continuous effects on the target of a logistic regression model. This is because the odds ratio quantifies the change in odds of the outcome for a one-unit change in the predictor variable. Let’s pull a specific example from the details table. Select the Maximize button on the object toolbar which opens the details table at the bottom of the canvas. Scroll over and select the Odds Ratio tab to display the odds ratio estimates for each effect in the model. Finally, click the Odds Ratio Estimate column twice to have the odds ratios sorted in descending order.
In a logistic regression, odds ratios that are greater than 1 indicate an increased odds of the target for a one-unit increase in the predictor. Odds ratios that are less than 1 indicate a decreased odds of the target for a one-unit increase in the predictor. Odds ratios close to or equal to 1 indicate there is no effect of the predictor on the target. The odds ratio for Certificate of Deposit (let’s call that COD) is almost 2.5. COD is an input that indicates whether a customer owns a certificate of deposit and is coded 1 for “yes” and 0 for “no”. Having an odds ratio of 2.5 tells us a couple of things. First, customers who have a COD are more likely to respond to our campaign compared to those that do not have a COD. Second, the odds for customers responding to a campaign are increased by approximately 150% for customers who have a COD (since 2.5 – 1 = 1.5).
Taking a look at the next plot on the control bar reveals the Residual Plot.
As you probably are aware, residuals are the differences between the observed values and the predicted values (of the model). There are a variety of important reasons that we examine plots of residual values during model building. These include, but are not limited to, the following:
Assessing model fit.
Detecting outliers and/or influential points.
Validating assumptions.
Detecting heteroscedasticity (non-constant variance).
The residual plot from our logistic regression does not appear to demonstrate any sort of pattern or trends. Appearing patterns in the residuals can be an indication of poor fit, non-linearity, or missing input variables. Ideally when building models, residuals randomly scattered around the “zero” line are desirable. The only issue appearing in this plot are some very large residual values appearing at the predicted probability of 1. You may be able to detect the faint blue line at the lower-right corner of the plot. While we don't have time to dig into those observations right now, in the real world we would investigate these outliers.
The final plot we will discuss today is the Confusion Matrix.
The confusion matrix is a starting point for evaluating model performance. There are several plots, charts, and statistics based off of these four key frequency counts. We start by choosing a cutoff probability (the default in Visual Statistics is 50% or 0.5) such that all new cases with the predicted probability greater than the cutoff are classified as events and all others are classified as non-events. The confusion matrix is computed by comparing our predictions to the actual target values. True positives are events that the model correctly predicted to be events (in our data, purchasers classified as purchasers). True negatives are non-events that the model correctly predicted to be non-events. False positives are non-events that the model incorrectly predicted to be events (in our data, a non-purchaser in the original data classified as a purchaser by our model), and false negatives are events the model incorrectly predicted to be non-events.
The color of the four cells of the confusion matrix is also significant. A darker color indicates a higher proportion of the value in that cell to the number of observed values for that level. Typically, one would desire to have high proportions (or darker color) for the correctly predicted observations. We’ll take note that the upper-left cell of true negatives is the darkest in color for our logistic regression. It’s also interesting to observe that the color of that cell is much darker when compared to the (diagonal) lower-right cell of true positives. Instead of discussing color shades, let’s open the details table and look at the actual counts and percentages. Select the Maximize button on the object toolbar which opens the details table at the bottom of the canvas. Scroll over and select the Confusion Matrix tab to display the frequency counts of the correct and incorrect, events and non-events in the model.
Upon preliminary examination, it appears that this model is doing a much better job of correctly classifying a non-event (or non-purchaser in our case) as opposed to correctly classifying an event. Approximately 88% of non-purchasers are correctly classified by our logistic regression, while only about 44% of the purchasers are correct. Does this mean that our model is “bad”? Or perhaps, could there be something else effecting these classifications?
To answer those questions, you’ll need to read my next post! We’ve done a great job today of discussing many of the output features and plots that are available for a logistic regression built in SAS Visual Statistics. But we still have a few more items to discuss. In my next post we’ll finish up with logistic regression by addressing the fact that this logistic regression (at this point in time) does a much better job at predicting non-events over events. In the real world, we typically want a model that is going to do a good job at prediction events. In the business case for our data, we’re much more interested in identifying a purchaser over a non-purchaser! I also want to show you the remaining assessment plots that are available to you with a logistic regression.
Thank you for continuing your journey with me of developing models in the AI and Analytics lifecycle. As I’ve mentioned before, if you are ready to learn more about logistic regression, I can suggest the following two courses: SAS® Visual Statistics in SAS® Viya®: Interactive Model Building and Predictive Modeling Using Logistic Regression. See you next time and never stop learning!
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- GEL
- logistic
- regression
- SAS Viya
- visual analytics
- Visual Statistics
06-21-2024
10:56 AM
2 Likes
In today’s post, we'll dive into the world of logistic regression. I’ll show you just how easy it can be to apply this powerful and well-known technique in SAS Viya. This post is the third installment in my series, where we utilize statistics and machine learning tools in SAS Visual Analytics to tackle real-world business challenges. We will continue to focus in on the part of the AI and Analytics lifecycle that involves developing robust models. In my previous post I discussed the many types of classification techniques including supervised and unsupervised methods. In addition, I demonstrated the unsupervised method of clustering. Today, we will switch over to looking at supervised methods beginning with an application of logistic regression.
Let’s first back up and briefly review the history of regression in general. The supervised method of regression dates to the early 19th century. The method of least squares was introduced in 1805, providing a systematic approach for fitting a line to a set of data points by minimizing the sum of the squares of the errors. Throughout the 20th century, regression techniques evolved significantly, with the introduction of computational power allowing for more complex models and the handling of larger datasets. The development of linear regression paved the way for various extensions, including multiple regression and logistic regression.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Logistic regression is one of the fundamental techniques in supervised learning, particularly effective for binary classification tasks where the outcome variable is categorical, typically representing two classes such as "yes" or "no," "success" or "failure." Many times, we generically discuss these two outcomes as the “event” and the “non-event.” Since the target is categorical, it’s interesting to note that the output or prediction from a logistic regression is not directly a classification. Rather, first the model uses the logistic formula to produce the predicted probability of the default class. That probability is then compared against a cutoff value (typically .5), and this results in a classification.
The popularity of logistic regression lies in its simplicity, effectiveness, and interpretability making it a first-stop method for many analysts and data scientists. Unlike more complex algorithms like neural networks, logistic regression provides clear insights into the impact of each predictor variable on the outcome. This makes it an invaluable tool for not only prediction but also for understanding the underlying contributors of the data and business problem you are analyzing.
In the real world, logistic regression is used in a myriad of businesses. Three common usages of logistic regression are predicting customer churn, loan default, and presence of disease. Retailers, both online and offline, use it to predict churn by examining purchase history, browsing behavior, and customer feedback to identify at-risk customers and improve retention strategies. Banks and credit card companies use logistic regression to predict account closures and loan default, analyzing transaction history, customer service interactions, and financial health. In healthcare it is used to predict the presence of cardiovascular diseases by analyzing factors such as age, blood pressure, and lifestyle habits. Logistic regression can help identify high-risk patients who might benefit from either preventative measures or early treatment.
In our example of logistic regression, as a data scientist we’re attempting to use a variable annuity (insurance product) data table named develop_final to identify customers who are likely to respond to a variable annuity marketing campaign. I’m going to continue to use the same data that I introduced in the previous post. The develop_final table contains just over 32,000 banking customers and input variables that reflect both demographic information as well as product usage captured over a three-month period. The target variable is Ins which is a binary variable. For Ins, a value of 1 indicates the customer purchased a variable annuity product and a 0 indicates they did not. Please take note that I have performed some data clean-up (including binning, transforming, and imputation) and variable selection (using variance explained) so that we are ready to build supervised models. If you’re interested in seeing some of those data cleansing techniques performed on the develop table, please see Supervised Machine Learning Procedures Using SAS® Viya® in SAS® Studio.
From SAS Drive, we open the Visual Analytics web application by clicking the Applications menu icon and selecting Explore and Visualize. From the Explore and Visualize window, we click on New report. In the left-hand Data pane, select Add data. Find and select the DEVELOP_FINAL table and then Add.
With the data already cleaned and prepped for model building, we are ready to create our logistic regression. On the left, we’ll change from the Data pane to the Objects pane by selecting Objects. From this list we can scroll down until we find the list of Statistics objects. From there we can either double-click or drag-and-drop the Logistic regression object onto the first page of the report.
Select the Options pane on the right. Before we assign data to roles for this logistic regression object, let’s note that by default the Informative missingness option is not selected. This is the default selection for most models. Informative missingness can be very useful if your data contains missing values and you have not addressed them during your data preparation phase. Since many of the available models use complete case analysis, we might lose many valuable rows of data. Complete case analysis provides a straightforward approach to handling missing data by excluding incomplete observations, even is just one value is missing. The potential drawback of this approach relates to data loss and bias. Informative missingness is an incredibly easy way to address the complete case analysis behavior. If we assume you have not already imputed the missing values, you can select the informative missing option. It extends the model to include observations with missing values by imputing continuous effects with the mean. It also has classification effects treat missing values as a distinct level. In addition, an indicator variable is created that denotes missingness. For today’s blog, let's assume that I have not addressed all missing data and select Informative missingness under General in the right-hand Options pane. I’ll also point out that there is no variable selection method selected, but we will chat more about that later.
Next click on the Data Roles pane on the right. Assign Ins as the Response variable, Area Classification and BIN_DDABal as Classification effects and all 34 measures available as Continuous effects. Including the y-intercept, that is a total of 37 effects for this logistic regression. We will want to discuss some variable reduction techniques later on to make this model a little more manageable.
Examining the Summary Bar at the top of the of the canvas lets us know several things. We have performed a logistic regression on the target variable Ins. Our model has chosen an event level of 1, which means our model is designed to predict those customers that purchase an annuity. The default model fit statistic is KS (Youden) with a value of 0.4246. And there were some 32K observations used in the building of this model.
Underneath the summary bar there are logistic regression results including the Fit Summary, Odds Ratio Plot, Residual Plot, and Confusion Matrix. Let’s define each of these at a very high level and save the details for my next blog. The Fit Summary plot displays the importance of each variable as measured by its p-value. The Odds Ratio Plot displays the odds ratio estimates for each variable in the model, including confidence intervals and p-values. The Residual Plot shows the relationship between the predicted value of an observation and the residual of an observation. And finally, the Confusion Matrix displays a summary of both the correct and incorrect classification for both the “event” and the “non-event.”
We’ve just begun our journey into supervised classification by producing a logistic regression. As we continue to develop models in the AI an Analytics lifecycle, we will witness even more interesting techniques. In my next post, I’ll cover in detail the output results along with their interpretation for this logistic regression. If you are ready to learn more about logistic regression, I can suggest the following two courses: SAS® Visual Statistics in SAS® Viya®: Interactive Model Building and Predictive Modeling Using Logistic Regression. See you next time and never stop learning!
... View more
- Find more articles tagged with:
- GEL
- logistic
- regression
- SAS Viya
- visual analytics
- Visual Statistics
05-17-2024
05:25 PM
In today’s post I will continue to show you how easy it is to perform data-driven analytics in SAS Viya. This is the second in a series of posts that will use statistics and machine learning objects in SAS Visual Analytics to address real world business problems. Using SAS Viya, we will continue to focus in on the first two parts of the AI and Analytics lifecycle: managing data and developing models. Since we covered many highlights of managing data in the previous post, today we will move into developing models.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
While there are many types of classification techniques (including semi-supervised and reinforcement learning), we will focus on both supervised and unsupervised methods that are currently available in SAS Visual Analytics. The main difference between the two sets of methods is in the presence or absence of a known output or target variable.
Unsupervised classification does not rely on having a target variable in the data (also known as unlabeled data). It focuses on finding structures or patterns within the input data, only. Clustering and dimension reduction are two prime tasks accomplished through unsupervised learning models. On the other hand, supervised classification involves learning from a dataset where the target variable or output is known. The goal of supervised classification is to learn the correlation or relationship between input and output variables. We typically feed a target along with several input (or descriptor) variables into a model that can then perform classification or prediction. With supervised models, the target variable can be a class variable (e.g., a binary like loan default, yes or no) or a continuous number (e.g., like dollars spent). Common supervised learning algorithms include regressions, decision trees, and neural networks.
Let’s begin our journey into the model building phase of the analytics life cycle by examining the unsupervised method of clustering.
Three common usages of clustering are finding groups with similar characteristics, product recommendation systems, and anomaly detection. Marketing strategies often involve the finding of clusters of customers that have different product affinities or product usage patterns. Clustering on groups with similar characteristics allow marketers to label those clusters and potentially find new sales. Products often fall into clusters of items that are frequently purchased together. We’ve all seen this when making online purchases and suggested products are offered for our “check-out basket”. And finally, clustering for anomaly detection makes it easy to identify which records might fall outside of all identifiable clusters. Those outliers could represent financial fraud, disease, or any other type of anomaly.
In our example of clustering, as a data scientist we’re attempting to use a variable annuity data table named develop_final to help us better understand our customers. I’m going to continue to use the same data that I introduced in the previous post. The develop_final table contains just over 32,000 banking customers and input variables that reflect both demographic information as well as product usage captured over a three-month period. The target variable is Ins which is a binary variable. For Ins, a value of 1 indicates the customer purchased a variable annuity product and a 0 indicates they did not. Please take note that I have performed some data clean-up (including binning, transforming, and imputation) and variable selection (using variance explained) so that we are ready to perform model building. If you’re interested in seeing some of those techniques performed on the develop table, please see Supervised Machine Learning Procedures Using SAS® Viya® in SAS® Studio.
From SAS Drive, we open the Visual Analytics web application by clicking the Applications menu icon and selecting Explore and Visualize. From the Explore and Visualize window, we click on New report. In the left-hand Data pane, select Add data. Find and select the DEVELOP_FINAL table and then Add.
With the data already cleaned and prepped for model building, we are ready to create our cluster. On the left, we’ll change from the Data pane to the Objects pane by selecting Objects. From this list we can scroll down until we find the list of Statistics objects. From there we can either double-click or drag-and-drop the Cluster object onto the first page of the report.
Before we assign data to roles for this cluster object, let’s note that by default 5 clusters will be created from our selected data using standard deviation as a method of standardizing the measure variables to a similar scale. Standardization is very important when clustering because should you have inputs with wildly different scales, those inputs with larger scales will dominate the distance calculations leading to bias. I’ve found that using the range for standardization tends to give me better separation in the two-dimensional cluster output that we will examine shortly. Select Range under Standardization in the right-hand Options pane. Even though the default of 5 clusters might be fine for our analysis, I’m going to go ahead and select Automatic (Aligned box criterion) under Number of clusters. This is a great option if we have no idea how many clusters are appropriate for our data. The Aligned Box Criterion (ABC) method estimates the number of clusters based on principal components of the input data.
In the center of the cluster object on the page, select Assign data and Add: Age, Credit Score, and Home Value. Then select Apply and then Close. In the Options pane on the right, scroll down and select ABC Statistics under Model Display -> General -> Displayed visuals. This will allow us to view the calculations that resulted in the selection of 3 clusters for our data. Also, under these General options, change the Plot layout from Fit (default) to Stack. This places each piece of output on a separate tab and maximizes the real estate available on the page for easier viewing.
The cluster diagram is a two-dimensional projection of each cluster onto cells that contain a pairing of all inputs. For example, the lower left cell contains a crossing of Age and Home Value. These projections help us spot cluster similarities and differences as we view each pairing of inputs. Each cluster is assigned a unique color and cluster ID. If we continue to examine the crossing of Age and Home Value, we can make a few observations about the clusters and the data. Cluster 1 (the large blue cluster) consists of the more expensive homes for all age ranges. Clusters 2 and 3 (yellow and purple, respectively) consist of the less expensive homes for two different age ranges. Cluster 2 ranges with ages from 40 to 85, while Cluster 3 ranges from 20 to 55. Even though each cluster is unique, it is not unexpected to see overlap in the cluster diagram. Remember, this chart is a two-dimensional view of a three-dimensional solution. Next, let’s examine the output provided for the aligned box criterion. Select ABC Statistics.
We can see that by default this calculation will consider from 2 to 6 total clusters. We could change that on the Options pane if necessary. The default estimation criterion for ABC is known as the “Global peak value.” It’s clear that 3 clusters gave the maximum Gap value for our data. Finally, let’s examine the Parallel Coordinates plot.
On the far-left side of the Parallel Coordinates plot we see the same three clusters that we saw on the Cluster Diagram. The clusters have the exact same cluster IDs and colors. Along the top of the grey, binned columns we see the three inputs that were used to create the clusters: Age, Credit Score, and Home Value. Each of the columns have been divided into 10 bins that include the data range. For example, ages range from 16 to 94 so each bin contains customers in an approximately 8-year age range. If we examine the header at the top, we can see that this plot contains 847 polylines. Follow the polylines from left to right to determine which range of values pertain to each cluster. The thickness of a line indicates the relative number of observations in that bin. For example, if we follow a couple of the thickest polylines for Cluster ID 3 (the purple cluster) from left to right we observe the following characteristics. A large number of customers in Cluster 3 are younger in Age, have middle-of-the-road credit scores, and lower home values.
Before we finish up this post, let’s not forget to open up the Details Table for some juicy tidbits of information about our cluster analysis. Select Maximize in the upper-right corner of the cluster object.
The Details Table of all the model objects available in SAS Visual Analytics contain a treasure trove of often overlooked information. On the Centroids tab we can find the centroid definition for each cluster. You can see these values reflected back in the first Cluster Diagram that we examined. Simply mouse over the large X in the middle of each cluster and these same numbers will appear. Select the Model Information tab.
On this tab we can see that the k-means clustering algorithm was used in the background. The k-means algorithm is a very popular clustering algorithm that allows you to specify how many clusters (k being an integer) should be created. It begins by randomly selecting ‘k’ initial cluster centroids known as seeds. It works by minimizing the within-cluster variance and iteratively updating the cluster centroids until convergence occurs. Advantages to using k-means clustering are that it is efficient and scales well to large data. Disadvantages are that it is sensitive to both the initial centroid selection and to outliers. The cluster object in SAS Visual Analytics uses k-means algorithm when all the inputs are interval. If all inputs are categorical, the k-modes algorithm is used. If you mix input types of data, there is a k-prototypes algorithm that is used.
I hope you’ve enjoyed our journey into the developing models portion of the AI and Analytics lifecycle. For now, clustering is the only unsupervised method that I plan to cover in this series. In my next post, I plan to introduce supervised analysis. If you would like to learn more about clustering, I can suggest the following two courses: SAS® Visual Statistics in SAS® Viya®: Interactive Model Building and Applied Clustering Techniques. See you next time and never stop learning!
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- clustering
- GEL
- SAS Visual Analytics
- SAS Viya
04-04-2024
03:22 PM
The purpose of this blog is to show how easy it is to perform data-driven analytics in SAS Viya. This is the first in a series of posts that will use statistics and machine learning objects in SAS Visual Analytics to address real world business problems. SAS Viya does an awesome job of connecting all aspects of the AI and Analytics lifecycle. We’ll be focusing in on the first two parts of that lifecycle: managing data and developing models.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
In today’s world, data scientists need efficient and powerful tools to wrangle their data into easy to interpret solutions. We will step through both parts of the analytics lifecycle in order to investigate and address a scenario that involves a variable annuity data table called develop. The develop table contains just over 32,000 banking customers (or cases) and 47 input variables. The scenario involves a banking target marketing campaign that aims to identify those customers most like to purchase a variable annuity. If you’re not familiar, a variable annuity is a type of insurance product that typically contains features like a death benefit and/or lifetime income. The 47 input variables reflect both demographic information as well as product usage captured over a three-month period. The target variable is Ins which is a binary variable. For Ins, a value of 1 indicates the customer purchased a variable annuity product and a 0 indicates they did not.
Let’s begin our journey through the data phase of the analytics lifecycle with some data exploration. One of the very first things that I learned when I started working at SAS over 33 years ago was “Know Thy Data.” Which basically means, you should always explore the features of your data before you start any serious analysis. Exploration of data involves things such as looking for noticeable anomalies or patterns, finding data errors that need to be corrected, and determining if any imputation or transformations might be needed. Since we’re going to be working in the Visual Analytics interface of SAS Viya, we should take advantage of some views of the data that are already available.
From SAS Drive, we open the Visual Analytics web application by clicking the applications menu icon and selecting Explore and Visualize. From the Explore and Visualize window, we click on Start with Data and select the DEVELOP table. The Details tab of the Choose Data window already gives us some useful information about the table. Having a total of 51 columns and just over 32,000 rows, we can see the names, labels, and types of each column.
Moving over to the Sample Data tab we begin our data exploration by examining some of the values of our data. Right away, we’ll notice we have several binary variables and even some missing values. That’s good information to just keep in our back pocket for now.
Where things get really interesting is when we click on the Profile tab. First, let’s note that there are several variables that contain null or missing values. This is something we will want to address since we will be building models. Many of the methods that we’ll be looking at use “complete case analysis” which means that if a value of an input is missing, the entire row is eliminated from the analysis. Complete case analysis can result in the loss of large quantities of data, so it is good to identify which inputs have missing values in our data. Secondly, as we examine the Profile tab, let’s note that the input named Branch has a total of 19 unique levels. Sometimes, categorical variables with too many levels can be problematic in model building. Finally, if we scroll down, it is useful to note that the customerid column is 100% unique which makes it a primary key for this table.
Click on OK to bring the entire develop table into Visual Analytics and we can further explore the data. Grab our target variable Ins with the mouse and drag and drop it onto the canvas. That will cause us to get an Auto Chart which is basically the chart that Visual Analytics “thinks” we would like to see. We end up with a bar chart of frequency counts for the two levels of Ins. In the right-hand pane select Roles and left-click Frequency (under Measure) to open the Replace Data Item window. Select Frequency Percent from the list of variables. The chart now reveals to us that nearly 35% of our customers are purchasers (or have acquired the insurance product). That’s a reasonably healthy percentage of events for our target. When building models, we typically get concerned when the event is 10% of the target or less. It can be difficult to model data with a target that has what is known as a “rare event”. We might choose to address that situation with an over-sampling method if we had that issue. But we are in good shape with our target of Ins.
Next, let’s randomly grab one of our inputs like the household income and examine its distribution. Click on New Page to give us a blank canvas. In the left-hand pane select Data and then drag and drop Income onto the canvas. Income appears to be positive and highly skewed to the right. Though Income might be a good candidate for transformation, it also has some missing values (not shown on this plot). It’s appropriate to address the missing values first and then look at transformation.
If you have several variables that require modifications, it might be easier to perform some of your data cleansing using code. But since we are focusing in on the Visual Analytics interface, let me show you how we can do it in a point and click interface. Let’s begin by imputing the missing values of the Income variable with the average value which is 40.59. An easy way to find this average value quickly is by selecting the Actions icon on the Data pane and then clicking on View measure details.
To begin the imputation process on the Income column, select New data Item and then Calculated item on the Data pane. In the Name field of the New Calculated Item window, enter Imp_Income as the Name of the newly created variable. Click the Operators tab and expand the Boolean group. Double-click the IF…ELSE operator to add it to the expression. Then expand the Comparison group and drag x=y to the condition field in the expression. Next, return to the Data Items tab and expand the Numeric group. Drag Income to the number field on the left of the equal sign in the parenthesis. Enter . (missing) in the number field to the right of the equal sign in the parenthesis. Enter 40.59 on the number field for the RETURN operator. Finally drag Income to the number field for the ELSE operator.
Select OK to close the Edit Calculated Item window and the new column Imp_Income appears in the Data pane. You can create a new Auto Chart and compare the original column to the new one. Drag Imp_Income and drop it on the same page and to the left of the original Income distribution chart. We see a spike or peak where we replaced the missing values with the average value.
Let’s finish up this post by using a log transformation on Imp_Income in hopes of getting what looks like more of a normal distribution. While we could create another new data item, I think it is more efficient to keep the one we have and modify it. Right-click on Imp_Income and select Edit. Update the Name field of the Edit Calculated Item window from Imp_Income to Log_Imp_Income. Right-click the entire condition box and select Use Inside → Numeric (advanced) → Log. Then type 10 into the number field and hit Enter. Once again right-click the entire condition box (and only the condition box) and select Use Inside → x + y. Then type 1 into the number field and hit Enter. Since we know that Income has a minimum of 0, we are adding 1 before we take the log to avoid errors. This is a common practice.
If all that pointing and clicking is just too complex, you can always just go over to the Text tab of the Edit Calculated Item window and type in the expression.
Click OK to close the window and you will find that the frequency distribution plot has been automatically updated with the new calculated expression. Good news, it looks like the transformed version of Imputation looks more normally distributed than the original. Of course, we have a peak where we imputed the missing values, but that was expected.
If you are already dealing with data in the real world, you know that we’ve just scratched the surface of data exploration, data cleaning, and data management. Hopefully, I’ve given you an idea of what is involved in the data portion of the analytics life cycle. In my next post, I plan to move into both unsupervised and supervised analysis. If you would like to learn more about exploring and managing your data in SAS Visual Analytics, I suggest you check out SAS® Visual Analytics 1 for SAS® Viya®: Basics. In this course you will learn how to perform data discovery and analysis, access and investigate data, and view and interact with reports using SAS Visual Analytics. Never stop learning!
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- data science
- GEL
- SAS Visual Analytics
- SAS Viya
02-10-2024
10:19 PM
Definitely looks like a problem with the image. I would request and start from scratch with a fresh image. Especially if technical support has not been able to help.
... View more
01-17-2024
02:23 PM
1 Like
The purpose of this blog is to show how easy it is to classify a nominal target in SAS Viya. There are many examples of binary classification in both real life and in published literature. For example, will a customer default on a loan or not? Will a citizen vote Democratic or Republican? Will a patient return to the hospital or recover at home? But there are sometimes when the outcome is non-binary. For example, which one of these 5 cars will a customer purchase? Which one of these 3 fruits match the image? Which one of these 4 diseases match the patient's symptoms? When our target has multiple (more than 2) independent levels, we will want to perform multinomial classification. When data points can belong to multiple classes, multinomial classification allows data scientists to model and address real-world, complex problems which are inherently multi-faceted in nature.
In this blog, I'll be using a dataset created from a higher education institution to address the challenge of academic dropout and failure. The dataset consolidates information from a variety of databases on students pursuing undergraduate degrees in multiple fields. It includes details at the time of enrollment such as demographics, socio-economic factors, and academic performance. The target has three levels: dropout, enrolled, and graduate. The challenge is to build a machine learning model that can accurately predict a student's classification and hopefully reduce academic failure in higher education. Here is a quick snapshot of the data which contains 37 columns and just over 4400 rows.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
In one of my previous blogs, An Introduction to Machine Learning in SAS Model Studio, I showed you how easy it was to build machine learning models in the SAS Viya. We are going to use SAS Model Studio today in order to build several models that can handle our challenge of a multinomial target. From SAS Drive, we open the Model Studio web application by clicking the applications menu icon and selecting Build Models.
In the Projects tab, we select New Project to create a new modeling project for our academic data. We fill out the New Project window. Let’s name this project Student Success Factors and select the Advanced template for class target. This advanced template has a nice variety of machine learning models including logistic regression, neural network, and gradient boosting. The SSF data table has already been loaded into memory in the PUBLIC library, so we can easily select it. We’ll also change the data partition by selecting Advanced -> Partition Data and changing the Training percentage to 70 and the Test percentage to 0. This eliminates the test partition from our table which now only contains two partitions, training and validation. Select Save and Save again to create the new project.
SAS Model Studio opens to the Data tab. We can scroll down to our dependent multinomial variable and see that it has been conveniently identified as the target variable because it is called Target. Select Target to examine the properties of this column. If we select Specify the Target Event Level, it will reveal the that the default event level is Graduate. By default, the software selects the highest value in alphanumeric order. The level of Graduate is nearly 50% of all the target data. While this high percentage could end up affecting the performance of our model, we won’t address unbalanced target data today. We will move forward in building our model keeping the default level.
Moving over to the Pipelines tab, it is time to run the advanced template that we selected when we created the project. The Advanced template for a class target contains a pipeline with a total of six models (the purple nodes) and one ensemble model. Even though we could modify the construction of this pipeline (e.g., add or delete nodes) or change the defaults of the existing models, we simply choose to use the template unaltered. Select Run pipeline to run all the included nodes which will reveal a champion model.
Right-click the Model Comparison node to open Results. The results from pipeline reveal that the Gradient Boosting model is the champion based off the KS(Youden) model fit statistic.
Let’s check out the results of our champion model. Right-click the Gradient Boosting node and select Results. On the Node tab expand the Output to see the results from the GRADBOOST Procedure. Scroll to the bottom and note that this machine learning model will calculate three predicted probability variables along with the predicted target variable.
If we close the procedure output and select the Output Data tab, we can examine the actual predictions. Select View output data twice to open a data table that includes the original inputs along with the scored predictions. I opted to manage the columns in order to rearrange the default order of the input variables so we could focus in on the predicted values. The first column is now the actual target value, and the second column is the predicted value for the target. As far as the first ten rows, it appears that the gradient boosting model is doing a good job at correctly predicting the target value.
Let’s return to the Summary tab for a small journey into model interpretation. Just like neural networks, gradient boosting models are also known as “black box” models or models that have little to no interpretability. Unlike a decision tree (which is very easy to interpret) a gradient boosting model which is a combination of (in this case 300) trees, is difficult to interpret. In other words, we sometimes want to be able to say why this model decided to classify a student as a Graduate versus a Dropout. Examining the Variable Importance table attempts to answer that question. At the very least, by expanding the Variable Importance table we can see which inputs are major contributors to this classification decision. It appears that the number of curricular units approved in the 2 nd semester is at the top of the list along with inputs such as course number, inflation rate, GDP, grade average in the 2 nd semester, and whether tuition fees are up to date.
Closing the Variable Importance table, let’s finally look at some model assessment output. Select the Assessment tab. Many of the assessment features like ROC and cumulative lift will be based off of binomial classification or “event” versus “non-event”. In other words, since our target level is Graduate, the levels of Dropout and Enrolled will be grouped together. However, there is one chart which reveals what is happening for all three levels independently. Expand Nominal Classification which defaults to a percentage plot. By examining the validation partition results, we can see that this model is able to predict Dropouts and Graduates nearly 80-90% of the time, while only correctly classifying Enrolled about 50% of the time. This plot is based on frequency counts of the actual versus predicted target values for all three levels.
I hope you've enjoyed seeing how easy it is to build and interpret predictive machine learning models using SAS Model Studio. While I'm not an expert in higher education data, I bet with some business knowledge and some tuning of the hyperparameters we could increase the predictive accuracy of this model. Would you like to keep learning? Maybe you would be interested in taking an instructor-led course like Machine Learning Using SAS® Viya® which will get you started. In this course you will learn how to build several different models, tweak them to get better results, and learn how to interpret the results. In fact, this course can prepare you to get certified as a SAS Certified Specialist: Machine Learning Using SAS Viya.
Never stop learning!
... View more
- Find more articles tagged with:
- GEL
- machine learning
- regression
- sas model studio
- SAS Viya
Labels:
11-09-2023
01:38 PM
3 Likes
The purpose of this post is to show how easy it is to automatically explain a target variable in SAS Viya in just a couple of clicks. Being able to understand the relationship between a target and its explanatory variables is a key step towards building predictive models. An automated explanation will quickly build a series of easily interpretable visualizations along with automatically generated storylines. Business analysts, data scientists, and even high-level executives can get a head start in answering everyday business problems with an automated explanation.
To get started, I’ll be using a data set which consists of observations taken from account holders at a large financial services firm. The accounts represent consumers of home equity lines of credit, automobile loans, and other short- to medium-term credit instruments. If you’ve been reading my posts, you’ll already be familiar with one of my favorite data tables. Since several of the continuous inputs are skewed and contain missing values, I’ll clean up the data with transformations and imputation before the automated explanation. This results in all of the variables starting with the expression logi_ which stands for inputs that have been log transformed and imputed missing values.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
The binary target variable indicates if an account contracts for at least one product during the campaign season. A straightforward way to think of this is that a target value of 1 indicates a purchaser and a 0 indicates a non-purchaser. The data sets contain more than one million rows and (filtered to) 16 columns. We will see more detail on the variables in our exploration, but they contain demographic information, account activity level, customer value level and various purchase behaviors.
If I want to use the automated explanation feature, I’ll first need to open the table in SAS Visual Analytics and filter out (hide) a few columns.
A quick and easy way to begin exploring my data would be to create an automated explanation. The automated explanation reveals the most important underlying features for a target variable. In this example, I’m trying to understand whether an account will make a purchase (or not). Let’s easily create that explanation in a report with one click. I right-click on the target variable tgt Binary New Product and select Explain > Explain on current page.
The resulting report reveals that my target variable has an 80% chance of being a 0. In other words, the majority of my customers were non-purchasers.
Much of the report is aimed at explaining the most common value of 0 (non-purchasers), but honestly, we are more interested in the behavior of the purchasers (value of 1). Let’s update the chart by selecting 1 in the button bar.
From the resulting report, I begin my data exploration and discover all kinds of interesting information about the target variable of customer purchase. You’ll notice that the summary bar along the top has been updated to show that approximately 20% of the customers made a purchase (value of 1). Then we can see under “What factors are most related to tgt Binary New Product?” that the following three variables are the most related factors: count purchased over the past 3 years, average sales over the lifetime, average sales over the past 3 years in response to a direct promotion. Of course, it makes sense that these three factors could have a large effect upon whether a customer would make a purchase or not. Notice that the top bar is already selected.
It would be interesting to understand the relationship between this top factor (count purchased over the past 3 years) and our binary target variable. Fortunately, we already have an automatically generated chart to help us. Let’s examine “What is the relationship between tgt Binary New Product and logi_rfm5 Count Purchased Past 3 years?”
We can see that for purchasers, the average number of products bought over the past 3 years is about 1.6. Keep in mind that our data was transformed, so in reality customers bought approximately 5 products on average. To investigate the relationship between the account activity level and our target, select the category 1 factor.
It makes sense that the accounts with the highest activity (value of “x”) contain the majority of our purchasers. Before we complete our investigation of the automated explanation, let's turn on an option to show us the most likely and least likely groups of purchasers. In the Options pane, turn on High and low groups.
A new visual appears on the canvas of our automated explanation. By default, we see the top three groups that have the highest predicted probability of making a purchase.
On the High tab we are presented with the top three groups that are most likely to make a purchase. Let's examine the first group which has an almost 80% predicted probability of making a purchase. If the count purchased over the past 3 years is greater than or equal to a 1.6 and it has been less than 2.6 months since the las purchase, then a customer is very likely (79.70%) to make a purchase. In case you are curious, there is a decision tree being created in the background to give us all this wonderful information.
We got a great start in beginning to understand the relationship between our target variable and the explanatory inputs. A next great step would be to build a supervised model like a logistic regression. If you would like to keep learning, you might be interested in taking an instructor led course. SAS® Visual Statistics in SAS® Viya®: Interactive Model Building will get you started. In this course you will learn how to build several different models, tweak them to get better results, and learn how to interpret the results. If you would like to learn more about automated explanations, I suggest reading this paper by Rick Styll.
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- GEL
09-08-2023
04:21 PM
4 Likes
The purpose of this post is to introduce how to develop machine learning models quickly and easily in SAS Model Studio. SAS Viya allows us to generate impactful insights by transforming data into value. While the analytics life cycle consists of three phases (data, discovery, and deployment), this post will focus on middle of that journey. SAS Viya’s machine learning capabilities are a great way to develop models during the discovery phase of the analytics life cycle. SAS Model Studio is the perfect interface to quickly build pipelines of machine learning models.
To get started, I’ll be using financial services data. The accounts in the data represent consumers of home equity lines of credit, automobile loans, and other short- to medium-term credit instruments. The target variable (b_tgt) relates to whether an account holder purchased a new product from the bank in the past year. The data set contains almost 53,000 rows and 22 columns. We will see more detail on the variables in our exploration, but they contain demographic information, account activity level, and various purchase behaviors. These features along with the target will be used to train our prediction models to identify which future customers might be purchasers.
From SAS Drive we open the Model Studio web application by clicking the applications menu icon and select Build Models.
Select any image to see a larger version. Mobile users: If you do not see this image, scroll to the bottom of the page and select the "Full" version of this post.
Model Studio application is where we can create our pipelines of models. We select New Project to create a new modeling project. We then fill out the New Project window giving our project an appropriate Name and select our data table named BANKPART_HOW. Conveniently, the data table has already been loaded into memory in the PUBLIC library.
Select Save to finish creating the new modeling project. SAS Model Studio opens to the Data tab and requests that we select a Target before we can run any pipelines. Remembering that our target variable is named b_tgt, we select it and set its role to Target. For tables that are used often across projects, you can easily save metadata properties to a global table. (By storing metadata configurations in the Global Metadata repository, properties will apply to new data tables that contain variables with the same names.)
Let’s take advantage of some of the pre-built pipelines that are included with Model Studio. Click the Add new pipeline icon to open the New Pipeline window. We give our new pipeline an appropriate name and select Browse to open the Browse Templates window.
Model Studio templates are pre-populated pipelines with configurations of various models. In addition to the three levels (basic, intermediate, and advanced) of included templates, customized pipelines can be saved to the Exchange where they become accessible to other users. Select Advanced template for class target to create a data mining pipeline that includes some sophisticated machine learning models like neural networks and gradient boosting machines. Then select Save to see the pipeline in Model Studio.
The Advanced template for a class target contains a pipeline with a total of six models (purple nodes) and one ensemble model. Even though we could modify this pipeline (e.g., add another model) or change the defaults of the existing models, we simply choose to use the template unaltered. Select Run pipeline to run all the included models and then reveal which is the champion model. Right-click the Model Comparison node to open Results. The results from pipeline reveal that the Forest model is the champion based off the KS(Youden) model fit statistic.
I've just shown you how quick and easy it is to build predictive machine learning models using SAS Model Studio. I bet if we fine tune or even autotune the hyperparameters of these models we could increase their predictive accuracy. Would you like to keep learning? Well, you have some great options. First, maybe you would be interested in taking an instructor-led course. Machine Learning Using SAS® Viya® will get you started. In this course you will learn how to build several different models, tweak them to get better results, and learn how to interpret the results. In fact, this course can prepare you to get certified as a SAS Certified Specialist: Machine Learning Using SAS Viya.
Secondly, maybe you would want to attend the SAS Explore conference, September 11-14 in Las Vegas, NV. I am presenting two Hands-on-Workshops on SAS Model Studio. As a Business Analyst or Data Scientist at SAS Explore, you can learn new techniques for deriving reliable insights for any of your business challenges. Trust me, even though I am presenting at the conference, just like you I will be looking for opportunities to increase my knowledge base. Never stop learning!
I hope to see you in Las Vegas or in one of our classes!
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- GEL
Labels: