This post is the second in a series on building machine learning models with a rare target. The first post focused on building a baseline model and an event-based sampling model on data with a rare target event (in this case it was credit card fraud and represented about 0.17% of the labeled data). We compared the performance of the two approaches and found that using event-based sampling did not improve model performance. In this second part we will focus on enhancing the data with synthetic minority samples (fake fraud examples) to compare this approach to the others from Part 1. We also explore the impact that cutoffs have on rare target event models and discuss some metrics for evaluating model performance that make more sense than naively comparing models using the misclassification rate.
Synthetic Minority Oversampling (SMOTE)
In this approach we create synthetic data representing additional fraudulent transactions. We do this by applying the synthetic minority oversampling technique (SMOTE) to create new interpolated cases near the existing fraud cases in the input space. This artificially increases the representation of fraud in the dataset, hopefully helping the model learn the decision boundary between fraud and non-fraud cases. The SMOTE Procedure allows us to generate as many synthetic samples as we want, so we can add 198676 synthetic fraud cases to the data to yield a 50/50 sample with all the original data and a lot of new synthetic events. This might not be sensible since most of our fraud events would be synthetic, but it allows us to compare the two extreme options for creating a balanced sample:
We can explore a wide range of options in between these two extremes, from performing event-based sampling to create a 20/80 unbalanced sample (dropping fewer non-fraud cases) to performing SMOTE to create a balanced sample with many fewer observations (adding a smaller amount of synthetic fraud events and then also dropping non-fraud events). It’s probably a good idea to consider these less extreme options for handling rare events, and the best choice will often depend on the overall amount of data and the number of real events in the data.
proc smote data=casuser.train seed=919;
input &inputs / level=interval;
input &target / level=nominal;
output out=casuser.smote_samples;
sample numSamples=198676 augmentvar=&target augmentlevel="1";
run;
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
The SMOTE Procedure generates synthetic fraud events from the training data, in this case enough to create a 50/50 balanced training sample when we merge the synthetic data back with the original data.
/*we have to merge back the new samples with the original training data*/
data casuser.train_smote;
set casuser.train casuser.smote_samples;
run;
proc freq data=casuser.train_smote;
table class;
run;
After merging the synthetic samples back with the original we can train a gradient boosting model on the augmented training data, score the original data using the trained model, and evaluate model performance on the original validation data. Note that we must adjust the predicted probabilities just like we did with the event-based sampling approach.
/*now we train a gradient boosting model on the SMOTE modified sample*/
proc gradboost data=casuser.train_smote outmodel=casuser.smote_model seed=919 noprint;
input &inputs / level=interval;
target &target / level=nominal;
run;
/*score the validation data using the trained baselines model*/
proc gradboost data=casuser.creditcardfraud inmodel=casuser.smote_model noprint;
output out=casuser.smote_scored copyvars=(class _partind_);
run;
/*adjust model predicted probabilities based on the smote sampling we performed, this correction will scale the predicted probabilities*/
%let zerorate = 0.9983;
%let onerate = 0.0017;
%let zerosample = 0.5;
%let onesample = 0.5;
data casuser.smote_scored;
set casuser.smote_scored;
tempA = P_class1 / (&onesample / &onerate);
tempB = P_class0 / (&zerosample / &zerorate);
P_class1_adj = tempA / (tempA + tempB);
P_class0_adj = tempB / (tempA + tempB);
drop tempA tempB;
run;
/*assess model performance on the validation data*/
proc assess data=casuser.smote_scored ROCout=casuser.smote_roc;
var p_class1_adj;
target class / event="1" level=nominal;
by _partind_;
run;
/*for now let's start by evaluating the model at the 0.5 cutoff*/
data casuser.smote_eval;
set casuser.smote_roc;
length approach $30;
approach='smote';
if _partind_ = 1 then partition='valid';
else partition='train';
where (_Cutoff_ <= 0.505 and _Cutoff_ >= 0.495);
keep approach partition _Cutoff_ _TP_ _FP_ _FN_ _TN_ _Sensitivity_ _Specificity_ _FPR_ _ACC_ _MiscEvent_ _C_ _F1_;
run;
/*let's do a running comparison of our models*/
data casuser.combined_eval;
set casuser.combined_eval casuser.smote_eval;
run;
proc print data=casuser.combined_eval;
run;
We can see that the augmented training data didn’t seem to help the model capture any more true positives than the baseline model, and in fact seemed to reduce the overall accuracy of the model. We might find better performance using fewer synthetic samples, but it’s hard to imagine that adding synthetically generated data will dramatically improve model performance on real data.
Baseline Model with Cutoff Adjustments
We can also use the baseline model, but instead of accepting the default 0.5 probability cutoff for fraud versus non-fraud events, we can instead choose our cutoff probability to best achieve our goal of identifying fraud cases. There will be meaningful tradeoffs associated with changing the cutoff (increasing true positives often increases false positives), but this is often worth it when building models on rare target events. We don’t even have to refit the baseline model, instead we can investigate more detail into the results from the ASSESS Procedure.
/*now let's return to the baseline model and explore the impact of using different cutoffs*/
data work.baseline_eval_cutoff;
set casuser.baseline_roc;
length approach $30;
approach = 'baseline';
where _partind_ = 1;
keep approach _Cutoff_ _TP_ _FP_ _FN_ _TN_ _Sensitivity_ _Specificity_ _FPR_ _ACC_ _MiscEvent_ _C_ _F1_;
run;
/*choose the cutoff with the lowest misclassification*/
proc sort data=work.baseline_eval_cutoff;
by _MiscEvent_;
run;
/*we could do this same analysis on the event-based sampling and SMOTE approaches as well*/
data work.eventbasedsampling_eval_cutoff;
set casuser.eventbasedsampling_roc;
length approach $30;
approach = 'event based sampling';
where _partind_ = 1;
keep approach _Cutoff_ _TP_ _FP_ _FN_ _TN_ _Sensitivity_ _Specificity_ _FPR_ _ACC_ _MiscEvent_ _C_ _F1_;
run;
/*choose the cutoff with the lowest misclassification*/
proc sort data=work.eventbasedsampling_eval_cutoff;
by _MiscEvent_;
run;
data work.smote_eval_cutoff;
set casuser.smote_roc;
length approach $30;
approach = 'smote';
where _partind_ = 1;
keep approach _Cutoff_ _TP_ _FP_ _FN_ _TN_ _Sensitivity_ _Specificity_ _FPR_ _ACC_ _MiscEvent_ _C_;
run;
/*choose the cutoff with the lowest misclassification*/
proc sort data=work.smote_eval_cutoff;
by _MiscEvent_;
run;
data work.eval_cutoff;
set work.baseline_eval_cutoff(obs=1) work.eventbasedsampling_eval_cutoff(obs=1) work.smote_eval_cutoff(obs=1);
run;
/*final summary of performance for validation data*/
data work.combined_cutoff;
set work.eval_cutoff casuser.combined_eval(where=(partition='valid'));
run;
proc sort data=work.combined_cutoff;
by approach;
run;
title 'Model Performance on Validation Data';
proc print data=work.combined_cutoff(drop=partition);
run;
In this example we choose the cutoff that leads to the lowest misclassification rate. This is a simple way to select a reasonably good cutoff value, but it makes an implicit assumption that False Negatives and False Positives are equally problematic for the business (it treats them as equal failures). When working with rare target events this is almost certainly an invalid assumption and thus misclassification is a naïve metric for these models. In our credit card fraud example, incorrectly flagging a transaction as fraud is generally going to much cheaper than incorrectly ignoring a fraudulent transaction (hopefully it costs less to do a thorough fraud verification than we’d lose through the fraud). The best practice is to identify the business cost of False Negative and False Positives and then choose a cutoff value that minimizes this business cost. We will explore this strategy in a subsequent post but note that it does require more information about the business use case than most predictive modeling tasks.
Notice in the results that adjusting the cutoff from 0.5 improves misclassification (that’s how we selected the best cutoff) but looking at the confusion matrix, we see that it may not achieve our business goals. For the baseline model changing the cutoff to 0.85 from 0.5 decreases misclassification from 0.046% to 0.042%, but it also reduces the number of True Positive fraud cases we identify from 120 to 117. It’s important to be aware of these tradeoffs when evaluating models without full knowledge of the costs for False Positives and False Negatives.
Comparison of Strategies
A takeaway from this analysis is that the misclassification rate is often a misleading metric for models with rare target events and focusing only on misclassification can lead to situations where a model is selected as a champion even if it doesn’t do the best job identifying the rare target event. Instead, a focus on a business-motivated balance between False Positives and False Negatives is generally a better choice for working with rare target events.
In this example the baseline model at the 0.85 cutoff looks like the winner, with the lowest misclassification on validation data and the lowest number of False Positives, only 5 (although there are 31 False Negatives). The smote model at the 0.09 cutoff minimizes the number of False Negatives at 23, at the cost of 21 False Positives. This could be the preferred model for the business, maybe the cost of the 16 additional False Positives in the smote model is less than the cost of 8 additional False Negatives in the baseline model.
If we don’t have the detailed business information to identify the relative cost of False Negatives versus False Positives, we can still try to evaluate models based on how well they identify the rare target event by calculating precision and recall, some metrics commonly used in anomaly detection:
It’s easiest to think about these in the context of the fraud modeling example we have been working on:
data work.precision_recall;
set work.baseline_eval_cutoff work.eventbasedsampling_eval_cutoff work.smote_eval_cutoff;
precision = _TP_ / (_TP_ + _FP_);
recall = _TP_ / (_TP_ + _FN_);
F1 = 2/(1/precision + 1/recall);
run;
The ASSESS Procedure does not automatically generate Precision and Recall values, but they are easy to calculate from the confusion matrix, we can also calculate the F1 value from them, which is the harmonic mean of Precision and Recall. The F1 score is automatically calculated by the ASSESS Procedure (it’s listed as _F1_), so we could also easily select models based on this metric.
Choosing the cutoff that yields the best F1 score is a good choice for rare target event models when we don’t have an explicit cost balance between False Positives and False Negatives. This provides a better way to select a model champion (and as important, an ideal cutoff for the model champion) than using misclassification.
proc sort data=work.precision_recall;
by approach descending F1 ;
run;
data work.F1_cutoff_eval;
set work.precision_recall;
by approach;
if FIRST.approach then output;
run;
proc print data=work.F1_cutoff_eval;
var approach _cutoff_ precision recall F1 _TP_ _FP_ _FN_ _TN_ _MiscEvent_;
run;
Using the F1 score as the metric of interest selects very different cutoff values than using misclassification as the metric of interest. In both cases the baseline model outperforms the approaches that are based on modifying the training sample to emphasize the rare target event. When working with rare target events in a situation where we don’t have information about the relative costs of different kinds of failure (False Positives versus False Negatives), the F1 score can be a better indicator of the model’s ability to identify the rare target even than the misclassification rate. On the other hand, if we know the costs of failure it’s best to calculate model performance metrics based on those costs.
Building machine learning models to predict rare target events can be a bit trickier than building models with balanced datasets, but the fundamentals remain the same. The main consideration is how to best evaluate the model and select a probability cutoff to successfully identify the rare events. There are several methods for trying to improve the performance of models with rare target events, but these methods are no substitute for thinking through the problem and choosing appropriate metrics and a good cutoff. In this post we explored Event-Based Sampling and Synthetic Minority Oversampling (SMOTE) as methods to improve models with rare target events. We didn’t find much improvement from using these methods, but they provide a way to explore different modeling options to try to improve model performance on a specific (and intentionally chosen) performance metric. Using popular techniques is no substitute for thinking through the problem details and the business value of the machine learning predictions.
Find more articles from SAS Global Enablement and Learning here.
Dive into keynotes, announcements and breakthroughs on demand.
Explore Now →The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.