In today’s post, we'll take a look at how to interpret the results of a logistic regression model built in SAS Viya. In my third post of this series, I showed you just how easy it was to build a logistic model in SAS Visual Statistics. I also discussed the origins of regression models along with the details of logistic regression. Moving forward we will continue to focus on the part of the AI and Analytics lifecycle that involves developing and interpreting robust models. Specifically, let’s examine the various pieces of output from the logistic regression model that was built using variable annuity (insurance product) data.
Remember, the business challenge is trying to identify customers who are likely to respond to a variable annuity marketing campaign and make a purchase. The develop_final table that was introduced previously contains just over 32,000 banking customers. The input variables reflect both demographic information as well as product usage captured over a three-month period. The target variable is named Ins which is a binary variable.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
Since we’ve already created a logistic regression model, let’s dive right into exploring the various pieces of output that Visual Statistics provides us. You may recall from my last post that we ended by giving a high-level overview of the following:
We want to take a deep dive into each of these items, but let’s make it a little easier on ourselves by taking advantage of two options. First, let’s open the Options pane of the logistic regression and scroll down to the Model Display options. Under the General category, change the Plot layout from Fit(default) to Stack. This model display option specifies how the subplots are displayed on the canvas. By default, we see all the output subplots shown together on one page. We can enhance viewability by changing the Plot layout to Stack such that each subplot fills the canvas. Using this option, a control bar enables you to move between subplots.
Second, scroll back up to the Logistic Regression options. Under the General category, let’s change the Variable selection method from the default of none to the Fast Backward method. Keep the default Significance level of 0.01. You might recall that we built this logistic regression model with a total of 37 explanatory variables. There are many reasons we want to avoid including too many effects in this model including: the possibility of overfitting, multicollinearity, and loss of interpretability. None of these are desirable traits. Parsimonious models, which achieve a good balance between simplicity and accuracy, are generally preferred. As I like to say: “simpler is better.” Using the fast backward method now will most likely cause a reduction in the number of input variables. Fast backward is a technique that uses a numeric shortcut to compute the next selection iteration quicker than the regular backward method. Any effects that do not meet our significance level of 0.01 will now be removed from the model.
Now that we have a large footprint of the Fit Summary plot and we’ve removed the insignificant effects, it is time to see exactly what this plot is telling us.
The Fit Summary window reveals the most significant predictor variables that affect the response variable. It displays the effects on the Y axis and the p-values on the X axis. The variable importance is based on the negative log of the p-value. The larger this value is, the more important the variable. You can determine the importance of the effect shown by examining the color, length, and location of the horizontal bars. The most important effects appear at the top of the plot. A blue bar shows that the variable importance is above the significance level. The longer the blue bar, the more meaningful the variable. It appears that in this logistic regression model, there are a total of 19 (out of 37) significant effects. At the very bottom of the plot, we see the first couple of inputs (Age and Amount Deposited) that have no bars. This indicates they have been removed from the model due to the variable selection option that was engaged.
It is also interesting to note that degrees of significance are indicated by the lightness or darkness of the color. For example, dark blue is most significant. Light blue is not as significant because the p-value is very close to the default p-value of .05. The default significance level (also known as alpha) is set to .05 and is plotted as a black vertical line in the pane. Hover your mouse pointer over the line to see the alpha and -log(alpha) values. You can move this line to change the significance level, but it will not affect the variable reduction option. Moving this line will only affect the degree of significance in relation to the lightness or darkness of the colored bars. The histogram bars at the very bottom of the graph displays the percent of the data that falls within the displayed range.
The next plot that is listed on the control bar is the Odds Ratio Plot.
Odds ratio estimates compare the odds of an event in one group to the odds of the event occurring in another group. Once the parameter estimates for the logistic regression predictor variables have been computed, it is very easy to calculate the odds ratio. Mathematically speaking, the odds ratio for a predictor variable is obtained by simply exponentiating its corresponding parameter estimate. Odds ratios are particularly useful for interpreting the effects of both categorical and continuous effects on the target of a logistic regression model. This is because the odds ratio quantifies the change in odds of the outcome for a one-unit change in the predictor variable. Let’s pull a specific example from the details table. Select the Maximize button on the object toolbar which opens the details table at the bottom of the canvas. Scroll over and select the Odds Ratio tab to display the odds ratio estimates for each effect in the model. Finally, click the Odds Ratio Estimate column twice to have the odds ratios sorted in descending order.
In a logistic regression, odds ratios that are greater than 1 indicate an increased odds of the target for a one-unit increase in the predictor. Odds ratios that are less than 1 indicate a decreased odds of the target for a one-unit increase in the predictor. Odds ratios close to or equal to 1 indicate there is no effect of the predictor on the target. The odds ratio for Certificate of Deposit (let’s call that COD) is almost 2.5. COD is an input that indicates whether a customer owns a certificate of deposit and is coded 1 for “yes” and 0 for “no”. Having an odds ratio of 2.5 tells us a couple of things. First, customers who have a COD are more likely to respond to our campaign compared to those that do not have a COD. Second, the odds for customers responding to a campaign are increased by approximately 150% for customers who have a COD (since 2.5 – 1 = 1.5).
Taking a look at the next plot on the control bar reveals the Residual Plot.
As you probably are aware, residuals are the differences between the observed values and the predicted values (of the model). There are a variety of important reasons that we examine plots of residual values during model building. These include, but are not limited to, the following:
The residual plot from our logistic regression does not appear to demonstrate any sort of pattern or trends. Appearing patterns in the residuals can be an indication of poor fit, non-linearity, or missing input variables. Ideally when building models, residuals randomly scattered around the “zero” line are desirable. The only issue appearing in this plot are some very large residual values appearing at the predicted probability of 1. You may be able to detect the faint blue line at the lower-right corner of the plot. While we don't have time to dig into those observations right now, in the real world we would investigate these outliers.
The final plot we will discuss today is the Confusion Matrix.
The confusion matrix is a starting point for evaluating model performance. There are several plots, charts, and statistics based off of these four key frequency counts. We start by choosing a cutoff probability (the default in Visual Statistics is 50% or 0.5) such that all new cases with the predicted probability greater than the cutoff are classified as events and all others are classified as non-events. The confusion matrix is computed by comparing our predictions to the actual target values. True positives are events that the model correctly predicted to be events (in our data, purchasers classified as purchasers). True negatives are non-events that the model correctly predicted to be non-events. False positives are non-events that the model incorrectly predicted to be events (in our data, a non-purchaser in the original data classified as a purchaser by our model), and false negatives are events the model incorrectly predicted to be non-events.
The color of the four cells of the confusion matrix is also significant. A darker color indicates a higher proportion of the value in that cell to the number of observed values for that level. Typically, one would desire to have high proportions (or darker color) for the correctly predicted observations. We’ll take note that the upper-left cell of true negatives is the darkest in color for our logistic regression. It’s also interesting to observe that the color of that cell is much darker when compared to the (diagonal) lower-right cell of true positives. Instead of discussing color shades, let’s open the details table and look at the actual counts and percentages. Select the Maximize button on the object toolbar which opens the details table at the bottom of the canvas. Scroll over and select the Confusion Matrix tab to display the frequency counts of the correct and incorrect, events and non-events in the model.
Upon preliminary examination, it appears that this model is doing a much better job of correctly classifying a non-event (or non-purchaser in our case) as opposed to correctly classifying an event. Approximately 88% of non-purchasers are correctly classified by our logistic regression, while only about 44% of the purchasers are correct. Does this mean that our model is “bad”? Or perhaps, could there be something else effecting these classifications?
To answer those questions, you’ll need to read my next post! We’ve done a great job today of discussing many of the output features and plots that are available for a logistic regression built in SAS Visual Statistics. But we still have a few more items to discuss. In my next post we’ll finish up with logistic regression by addressing the fact that this logistic regression (at this point in time) does a much better job at predicting non-events over events. In the real world, we typically want a model that is going to do a good job at prediction events. In the business case for our data, we’re much more interested in identifying a purchaser over a non-purchaser! I also want to show you the remaining assessment plots that are available to you with a logistic regression.
Thank you for continuing your journey with me of developing models in the AI and Analytics lifecycle. As I’ve mentioned before, if you are ready to learn more about logistic regression, I can suggest the following two courses: SAS® Visual Statistics in SAS® Viya®: Interactive Model Building and Predictive Modeling Using Logistic Regression. See you next time and never stop learning!
Find more articles from SAS Global Enablement and Learning here.
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.