BookmarkSubscribeRSS Feed

Data-Driven Analytics in SAS Viya – Logistic Regression Model Assessment

Started Monday by
Modified Monday by
Views 116

 

In today’s post, we'll take a look at how to assess a logistic regression model built in SAS Viya. In my third and fourth post of this series, I showed you just how easy it was to build and interpret a logistic model in SAS Visual Statistics. We want to cap off this discussion by returning to a set of outputs from that logistic regression model known as assessment plots. We will continue to focus on the part of the AI and Analytics lifecycle that involves developing and interpreting robust models. Specifically, let’s examine the remaining pieces of output from the logistic regression model that was built using variable annuity (insurance product) data.

 

Remember, the business challenge is trying to identify customers who are likely to respond to a variable annuity marketing campaign and make a purchase. The develop_final table that was introduced previously contains just over 32,000 banking customers. The input variables reflect both demographic information as well as product usage captured over a three-month period. The target variable is named Ins which is a binary variable.

 

01_AR_SampleOfData-1024x329.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

Since we’ve already created a logistic regression model and begun to examine the output, let’s dive right into exploring the various plots of assessment that Visual Statistics provides to us. You may recall from my last post that we ended by giving a high-level overview of the Confusion Matrix.  The confusion matrix reveals the correct and incorrect classifications of the model based off of a .50 cutoff probability. This cutoff value is the default for SAS Visual Statistics.

 

02_AR_ConfusionMatrix-1024x450.png

 

03_AR_ConfusionMatrixTable-1024x155.png

 

The confusion matrix is computed by comparing our predictions to the actual target values. True positives are events that the model correctly predicted to be events (in our data, purchasers classified as purchasers). True negatives are non-events that the model correctly predicted to be non-events. False positives are non-events that the model incorrectly predicted to be events (in our data, a non-purchaser in the original data classified as a purchaser by our model), and false negatives are events the model incorrectly predicted to be non-events.

 

As we discussed in the previous post, it appears that this model is doing a much better job of correctly classifying a non-event (or non-purchaser in our case) as opposed to correctly classifying an event. Approximately 88% of non-purchasers are correctly classified by our logistic regression, while only about 44% of the purchasers are correctly classified. Does this mean that our model is “bad”? Or perhaps, could there be something else effecting these classifications? Before we answer that question directly, let's look at another assessment plot which is very closely related.

 

Right-click on the confusion matrix to discover there are a total of 5 assessment plots:

 

  • Confusion Matrix
  • Lift
  • ROC
  • Cutoff plot
  • Misclassification

 

Let's select the Misclassification plot to examine the results.

 

04_AR_Misclassification-1024x527.png

 

The Misclassification plot displays how many observations were correctly and incorrectly classified for each value of the response variable. Honestly, this is the same information contained in the confusion matrix, only here it is grouped by value of the response variable. It is clearly visible by examining the first bar (the bar representing a target value of 1) that there are more incorrect classifications (approximately 6000) than there are correct classifications (approximately 4000). Another way to think of that is the yellow portion of that bar is larger than the blue portion. If we look at the target value of 0, we can see that the blue portion of the bar is much larger than the yellow. The Misclassification plot is based on the exact same values as the Confusion Matrix we saw above. So, it should be no surprise we see further validation that this logistic regression model is doing a much better job of correctly identifying the non-purchasing customers rather than the purchasing customers.

 

Did you know that both of the plots we are discussing are highly sensitive to the cutoff probability? That's right. If we change the default cutoff value of .50 to something higher or lower, the frequency counts of the true and false, positives and negatives, will most likely change as well. And this begs the question, can we change the cutoff value to improve the performance of this model? And how would we define or quantify that improvement? Would we settle for just improving the number of true positives (identifying a purchaser as a purchaser)? But what if that causes us to have less true negatives? The answer lies in the business objective that is stated for the analysis. In our case, remember we are trying to identify customers who are likely to respond to a variable annuity marketing campaign and make a purchase. Another way of thinking about this is that we are trying to maximize the total number of "true" or correct classifications. Whether that classification is negative or positive, maybe is not as important in this scenario. We want to find a model that is going to identify a purchaser as a purchaser and a non-purchaser as a non-purchaser. A great way to find that optimal cutoff value is to examine the Cutoff plot.

 

05_AR_CutoffPlot-1024x492.png

 

The Cutoff plot enables you to visualize how the cutoff value affects model performance. On the x-axis, the full range of cutoff values is plotted from 0 to 1. The cutoff value is specified in the Prediction cutoff option and is represented by the vertical line in the plot. You can drag the vertical line to adjust the cutoff value, which reassesses the model. The cutoff value is currently set at the default of .50. Remember if we were to move this cutoff all the way down to 0, then all rows of the data would be classified as an "event" or a value of 1 or in our case, a purchaser. That also means that we would capture 100% of the purchasers correctly, but we would not correctly identify any of the non-purchasers. In the other extreme, we could move the cutoff all the way up to 1. In that scenario, we correctly capture all of the non-purchasers, but none of the purchasers. So, where should we set the cutoff? The chart can give us a clue.

 

Let's also note that on the y-axis of the Cutoff plot, we have three different statistics represented by lines. The Accuracy rate is the total number of correct predictions divided by the total number of predictions. Sensitivity (also known as recall or the true positive rate) is a measure of the model's ability to correctly identify the positive class or the 1's. Specificity (also known as the true negative rate) is a measure of the model's ability to correctly identify the negative class or the 0's. As the cutoff moves from 0 to 1, you can see that the sensitivity is always decreasing and the specificity is always increasing. In our case, we'd like to maximize both the sensitivity and the specificity which is reflected in their intersection on this plot. This strategy works well for our business objective and give equal importance to both. This method of selecting a cutoff works when both error types are equally costly.

 

06_AR_CutoffPlot_Updated-1024x531.png

 

Dragging the vertical line to the intersection of sensitivity and specificity changes the cutoff value from .50 to .34. Since both the Confusion Matrix and the Misclassification plot are affected by this change in cutoff value, it will be interesting to see if they look "better" than as we originally witnessed them.

 

07_AR_ConfusionMatrix_Updated-1024x526.png

 

Examining the Confusion Matrix with the updated cutoff reveals that the diagonals are similar in color. Remember, darker color indicates a higher proportion of the value in that cell to the number of observed values for that level (or row). Previously, the cell of true negatives was much darker than the cell of true positives. Here, we're seeing a very nice balance. In other words, we a similar proportion in both of those diagonal cells. This is what we were aiming for when updating the cutoff. We wanted to increase the number of true positives, but not so much as to the detriment of the true negatives.

 

08_AR_MissclassificationPlot_Updated-1024x524.png

 

We see a similar improvement to the Misclassification plot. Previously, first bar (the bar representing a target value of 1) revealed more incorrect classifications than correct classifications. Now, the number of correct classifications has improved. Examining both of the bars at the same time, we see a relatively proportional number of correct classifications. In the previous plot, the proportions were imbalanced with the second bar proportionally favored with correct classifications.

 

In general, when trying to select an appropriate cutoff value, you can start by examining the percentage of events (also known as the base rate) of the original population of the target variable. In our sample of data in the develop_final table, the percentage of events (of purchasers) is approximately 35%. Many consider the percentage of events in the original population to be an excellent starting point for selecting a cutoff. By examining the Cutoff plot from the Visual Statistics logistic regression model, we ended up selecting a cutoff of 34%, nearly identical. This should be an indication that we are in the right neighborhood. There is always time for further fine-tuning of the model and the cutoff. Choosing a probability cutoff level will always depend on the specific goals of the analysis, the costs of classification errors, and the characteristics of the data. Other approaches that were not discussed in this post include:

 

  • Cost of Errors
  • Using the ROC (Receiver Operating Characteristic) Curve
  • Precision-Recall Tradeoff
  • Maximizing F1 Score
  • Cross-Validation
  • Maximizing Profit/Utility

 

Interesting enough, the final two assessment plots remain unaffected by a changing cutoff. The ROC chart actually includes the range of all the cutoff values, while the Lift chart is created independently from the cutoff.

 

To see those final two pieces of assessment, you’ll need to read my next post! We’ve done a great job today of discussing three of the five assessment plots that are available for a logistic regression built in SAS Visual Statistics. But we still need to discuss the ROC chart and the Lift Chart. In my next post we’ll finish up with logistic regression by discussing these two plots. We want to chat about how, when, and where they can be useful when assessing models. Thank you for continuing your journey with me of developing models in the AI and Analytics lifecycle. As I’ve mentioned before, if you are ready to learn more about logistic regression, I can suggest the following two courses: SAS® Visual Statistics in SAS® Viya®: Interactive Model Building and Predictive Modeling Using Logistic Regression. See you next time and never stop learning!

 

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
Monday
Updated by:
Contributors

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started