This post is the third in a series on building machine learning models with a rare target. The first two posts focused on different approaches for dealing with rare target events, while this post will focus on how to evaluate and compare models with rare target event levels, looking at both cutoff-dependent metrics and cutoff-independent metrics. We continue to work with credit card fraud data and will compare the same modeling approaches discussed in the previous two posts. We will look at the Precision-Recall Curve and the Area Under the Precision-Recall Curve (AUPRC), along with some methods for using business knowledge to create customized metrics based on business value for rare target event models.
When building models with rare target events the misclassification rate can be a misleading metric because it implicitly treats false positives and false negatives as equivalent in that they contribute equally to the overall misclassification rate. As an example, this means that when we try to minimize the misclassification rate, we are implicitly claiming that we would prefer “Model A” that outputs 5 false positives and 5 false negatives over “Model B” that outputs 15 false positives and 0 false negatives (since Model A would have an overall lower misclassification rate). This implicit assumption makes a big difference when we work with rare target events, since the goal is generally to identify the rare events rather than minimize overall error.
If we are predicting credit card fraud a false positive is a normal transaction that is flagged as fraud, where we trigger some additional security measures before processing the transaction. A false negative is a fraudulent transaction that we ignore, allowing the fraudsters to complete their transaction. A business leader would likely prefer Model B that identifies all fraud cases but triggers 15 unnecessary fraud security reviews over Model A that only triggers 5 fraud security reviews but allows 5 fraudulent transactions to proceed. A detailed accounting of the business cost of false positives and false negatives would allow us to create a custom metric to maximize business value, but we can do better than the misclassification rate even without this information.
The misclassification rate is a cutoff-dependent metric and thus is often used in machine learning models to select an optimal cutoff to minimize model error. Given the issues described above with the misclassification rate we could select the optimal cutoff based on the F1 score (the harmonic mean of precision and recall described in a previous post in this series), but an even better option would be to create a custom metric based on the relative cost of false positives versus false negatives and then select a cutoff that maximizes business value (or minimizes costs) based on the custom metric.
When comparing a lot of models and modeling approaches it can be helpful to separate the problem into an estimation problem (using models to estimate predicted probabilities) and a decision problem (selecting a cutoff to make business decision based on the model-generated probabilities). In this case we want to select the model that does the best job at estimating the probabilities independent of the cutoff we select. This means we cannot judge and compare models using cutoff-dependent metrics like misclassification or the F1 score. For traditional machine learning models, the area under the ROC curve (often called the ROC Index, C-Statistic, or just area under the curve, AUC) is used as a cutoff independent measure of model performance. The ROC curve itself plots the false positive rate on the X-axis and the true positive rate on the Y-axis, so the area under the curve represents a measure of how many true positives the model can capture without capturing too many false positives. This metric once again treats false positives and false negatives as equivalent, since missing a true positive is equivalent to finding a false negative, which is balanced in the ROC curve by finding a false positive. Model A described earlier would have a higher ROC index than Model B since it had overall fewer incorrect predictions, even though Model B might achieve better business value.
Given the problems with the ROC curve for rare target events, we instead look at the Precision-Recall curve, which plots precision on the Y-axis and recall on the X-axis. Precision and recall are both commonly used metric for anomaly detection and rare target event models and are discussed in the previous post in this series. Precision and recall are cutoff dependent metrics, so the Precision-Recall curve plots a range of precision and recall values corresponding to different cutoff values. We can calculate the area under the Precision-Recall curve (AUPRC) and then use this as a cutoff-independent measure of model performance when working with rare target events. We will look at how to create these metrics and plots in SAS for the credit card fraud modeling task described in the previous posts and then discuss some details about calculating and interpreting the AUPRC.
We start with the data already prepared, partitioned, and modeled based on the work in the previous posts, and we look at a few different ways to plot the Precision-Recall curve in SAS. We start with PROC LOGISTIC, which has a built-in option to plot the curve, all we have to do is specify the plots=PR option in the PROC LOGISTIC statement:
proc logistic data=casuser.creditcardfraud(where=(_partind_=0)) plots=PR;
model class(event="1")=V1-V28 amount;
run;
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
In this case the AUPRC is 0.7713 (just like with the ROC curve it will be a number between 0 and 1). An ideal model would have an AUPRC of 1, with the curve appearing in the top right of the plot, implying that there is a cutoff we can select that would yield a value of 1 for both precision and recall (perfect classification). Realistic models involve a tradeoff between precision and recall, and the Precision-Recall curve can help to identify good cutoffs that achieve high values of both precision and recall.
This automatically generated plot is convenient if we are building a linear model, but if we don’t use PROC LOGISTIC (if we built a more complex nonlinear machine learning model), we need to find another way to visualize the precision-recall curve and calculate the area under the curve. If we have values for precision and recall across a range of cutoffs, we can just plot the curve ourselves, but the calculation of the area under the curve is a bit complicated. Let’s start by manually plotting the curve.
proc sgplot data=work.precision_recall;
by approach;
scatter x=recall y=precision / markerattrs=(size=5px symbol=circlefilled);
run;
The by statement means that we will plot the curve separately for each modeling approach, so we can compare the precision-recall curves.
Visually inspecting the plots, it looks like the baseline approach does the best job at maximizing both precision and recall, although both the event-based sampling approach and the SMOTE approach had a few cutoff values that yielded perfect precision (with very bad recall), whereas the baseline model didn’t have these options. Another thing to note about the plot is that the curve itself is “spiky” or non-monotonic, which is different from the ROC curve. This spikiness means that estimating the area under the Precision-Recall curve using the simple trapezoidal rule is not quite correct, and we must use a more sophisticated interpolation scheme to correctly calculate the area under the curve (another option is to calculate the Average Precision instead of the area under the PR curve, which will the topic of a future post in this series). Rather than calculating this area manually we will use a SAS macro designed for the task, but the macro also takes into consideration the challenge of interpolating in the precision-recall space. See the reference to Davis and Goadrich (2006) for more details on calculating the area under the Precision-Recall curve.
%include '/path/to/prcurve.sas';
/*first it requires us to create a specific outROC table using PROC Logistic*/
proc logistic data=casuser.baseline_scored(where=(_partind_=1)) noprint;
model class(event="1")= / nofit outroc=work.baseline_roc_logistic;
roc pred=P_class1;
run;
/*then we call the macro program using the dataset output from PROC Logistic*/
title "Precision and Recall for Baseline Model";
%prcurve(data=work.baseline_roc_logistic);
The steps for using the %prcurve() macro (see the references to download the macro and for a description of the macro parameters) are straightforward:
When we use the macro to plot the Precision-Recall curve it looks a bit better (we do the non-trivial interpolation), and we also get the area under the Precision-Recall curve (AUPRC) along with the positive proportion (in this case the percentage of fraud events in the data). The smaller the positive proportion the more imbalanced the dataset and the more important it is to consider precision and recall instead of just looking at misclassification. We can create the same plot for our other modeling approaches from the previous posts (event-based sampling and SMOTE).
/*Event-Based Sampling*/
proc logistic data=casuser.eventbasedsampling_scored(where=(_partind_=1)) noprint;
model class(event="1")= / nofit outroc=work.eventbasedsampling_roc_logistic;
roc pred=P_class1;
run;
title "Precision and Recall for Event-Based Sampling Model";
%prcurve(data=work.eventbasedsampling_roc_logistic);
/*SMOTE*/
proc logistic data=casuser.smote_scored(where=(_partind_=1)) noprint;
model class(event="1")= / nofit outroc=work.smote_roc_logistic;
roc pred=P_class1;
run;
title "Precision and Recall for Synthetic Minority Oversampling Model";
%prcurve(data=work.smote_roc_logistic);
Using the AUPRC we still select the baseline model as our champion, outperforming our naïve attempts to use event-based sampling or SMOTE to improve model performance in the presence of very imbalanced data. Remember that the AUPRC is a cutoff-independent measure (just like AUROC) and thus we would still have to select an ideal cutoff that maximizes business value (instead of minimizing misclassification as we might do with simpler balanced data).
In this post we learned how to compare models using Precision-Recall curve instead of the traditional ROC curve, and how to calculate the area under the Precision-Recall curve using a SAS macro and the LOGISTIC Procedure (in either SAS 9 or SAS Viya). The AUPRC is a popular metric for evaluating models with rare target events and is convenient because it is independent of the probability cutoff selected when deploying the model. In the next post in this series, we will stick with using the AUPRC to compare models, but we will try a few different options to improve model performance for rare target event models over the baseline. We will also discuss some limitations associated with the AUPRC and discuss the Average Precision calculation.
References:
Find more articles from SAS Global Enablement and Learning here.
Nearly 200 sessions are now available on demand with the SAS Innovate Digital Pass.
Explore Now →The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.