The purpose of this post is to compare different ways to improve the performance of machine learning models when modeling unusual or rare target events. Machine learning models are often used to model binary targets, and in an ideal scenario the training data will have an equal balance of target event levels (half ones and half zeroes). Many realistic machine learning applications deal with rare target event levels like fraudulent transactions or unwanted accidents, so the original data is very unbalanced. There are a lot of different strategies for handling this unbalanced data, including sampling techniques, augmentation with synthetic data, and even simple cutoff adjustments. The value of these techniques seems clear in theory, but it can be hard to tell from literature what the best approach would be for any given modeling application. We will look at examples of modeling rare target event levels, focusing on making apples-to-apples comparisons of the different approaches to improving model performance. Part 1 of this series will cover Event-Based Sampling, while Part 2 will explore Synthetic Minority Oversampling along with some better ways to evaluate models with rare target events.
Credit Card Fraud Data
We start with a dataset containing historical credit card transactions, some of which are fraudulent and some of which are not. Our target will be to predict fraudulent events, but in the dataset, we only have 492 fraud cases out of a total of 284,807 events. The first step in modeling will be to split our data into a training sample and a validation sample. We will use the validation sample for honest assessment in comparing the different ways we try to manage the rare target event level.
cas;
caslib _all_ assign;
proc import datafile='/path/to/creditcardfraud_normalised.csv'
out=casuser.creditcardfraud
dbms=dlm
replace;
delimiter=',';
run;
/*the fraud target is class, and we only have 492 fraud cases, 0.17% of the data*/
proc freq data=casuser.creditcardfraud;
table class;
run;
/*partition the data into 70% training and 30% validation samples*/
proc partition data=casuser.creditcardfraud samppct=30 seed=919 partind;
by class;
output out=casuser.creditcardfraud;
run;
/*now in the training data we have even fewer fraud cases (344), it's still 0.17% of the data*/
proc freq data=casuser.creditcardfraud(where=(_partind_=0));
table class;
run;
Overall, we have a target event rate of 0.17%, with 492 fraud cases. Once we split the data into 70% training data and 30% validation data we end up with 344 fraud cases in the training data.
Combined Training and Validation Data:
Training Data Only:
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
To make a fair comparison, we need to use the same inputs for each model and do the same basic data preprocessing (for the comparison we won’t do any generic preprocessing, although some of our strategies will involve data preprocessing). In this case we define the target as credit card fraud and specify relevant input variables.
/*we have a bunch of PCA transformed columns representing historical customer information*/
proc contents data=casuser.creditcardfraud;
run;
/*we plan to fit a tree based model so let's just use all of the input variables*/
%let inputs=V1-V28 Amount;
%let target=class;
In this data we have 28 columns (V1 through V28) that have been derived from the original transaction features using principal components analysis (PCA). There is also an amount variable representing a transformed version of the amount of money spent in the transaction. The target variable is class indicating whether the transaction was credit card fraud (class=1) or if it was non-fraudulent (class=0). We will use gradient boosting models to predict the target, and since tree-based models have built in variable selection we will include all the V1-28 variables and the amount variable as inputs to predict the fraud target.
The Baseline Model
We start by fitting our ‘baseline’ model, which is just a naïve machine learning model where we don’t worry about the rare target event level and just accept the default results. We will use gradient boosting models for this comparison, they often work well with the default settings, and we hope to see good performance for some of our modeling strategies. Of course, we could do this same comparison using any kind of model, from simple logistic regression models to complex nonlinear neural networks.
/*fit the baseline gradient boosting model with the default settings*/
proc gradboost data=casuser.creditcardfraud(where=(_partind_=0)) outmodel=casuser.baseline_model seed=919 noprint;
input &inputs / level=interval;
target &target / level=nominal;
run;
/*score the data using the trained baseline model*/
proc gradboost data=casuser.creditcardfraud inmodel=casuser.baseline_model noprint;
output out=casuser.baseline_scored copyvars=(class _partind_);
run;
Model evaluation is important in this comparison, so we will evaluate our models using a few different metrics. We often evaluate machine learning models by judging the average squared error (which is based on the predicted probabilities) and the misclassification rate (which is based on how many cases we predicted correctly). In a real-world setting there is a different cost for missing fraudulent transactions than there is for flagging non-fraudulent transactions as fraud, so we will also look specifically at the confusion matrix for the models, evaluating how good the models are doing at identifying the fraud events.
/*assess model performance on the data, with separate evaluations for training and validation data*/
proc assess data=casuser.baseline_scored ROCout=casuser.baseline_roc;
var p_class1;
target class / event="1" level=nominal;
by _partind_;
run;
/*for now let's start by evaluating the model at the 0.5 cutoff*/
data casuser.baseline_eval;
set casuser.baseline_roc;
length approach $30;
approach='baseline';
if _partind_ = 1 then partition='valid';
else partition='train';
where (_Cutoff_ <= 0.505 and _Cutoff_ >= 0.495);
keep approach partition _Cutoff_ _TP_ _FP_ _FN_ _TN_ _Sensitivity_ _Specificity_ _FPR_ _ACC_ _MiscEvent_ _C_ _F1_;
run;
proc print data=casuser.baseline_eval;
run;
We are mostly interested in looking at the model performance on validation data, it looks like in the validation data we found 120 true cases of fraud and missed 28 true cases of fraud. We also incorrectly identified 11 real transactions as fraudulent. This yields a misclassification rate on the validation data of 0.0456%, which is very low but can be misleading since we have such a rare target. Note that we chose a naïve probability cutoff of 0.5, so we should keep in mind that changing this cutoff will change the accuracy and the confusion matrix.
Event-Based Sampling
In this approach we under sample the data, selecting all the fraud cases while randomly selecting a subset of the non-fraud cases so that we have a balanced sample of 50 percent fraud and 50 percent non-fraud cases. This dramatically reduces the size of our training data but ensures that the model can learn the decision boundary between fraud cases and non-fraud cases.
/*next we perform event-based sampling, selecting all fraud cases and an equal number of non-fraud cases*/
/*start by sampling 344 random class 0 cases*/
proc surveyselect data=casuser.creditcardfraud(where=(_partind_=0 and class=0))
out=casuser.class0_sample
method=srs
n=344
seed=919
noprint;
run;
/*combine the randomly selected class 0 cases with all of the class 1 cases to create a 50/50 sample for training*/
data casuser.train_balanced;
set casuser.class0_sample casuser.creditcardfraud(where=(_partind_=0 and class=1));
run;
We use the SURVEYSELECT Procedure to take a random sample of 344 non-fraud cases (zeroes). We combine the 344 sampled zeroes with all 344 one (fraud cases) to create a balanced training dataset with equal number of fraud and non-fraud cases. Note that we only apply the event-based sampling to the training data, and doing this dramatically reduces the amount of available training data. When we validate and deploy the model it will not see data with a 50/50 split like in the training data, so we should use the original data distribution for the validation sample.
/*now we can train a gradient boosting model using this 50/50 balanced sample, note that we did lose a lot of data here*/
proc gradboost data=casuser.train_balanced outmodel=casuser.eventbasedsampling_model seed=919 noprint;
input &inputs / level=interval;
target &target / level=nominal;
run;
/*score the validation data using the trained 50/50 balanced model*/
proc gradboost data=casuser.creditcardfraud inmodel=casuser.eventbasedsampling_model noprint;
output out=casuser.eventbasedsampling_scored copyvars=(class _partind_);
run;
We train the gradient boosting model on the 50/50 training sample, but we score the original dataset using the trained model. This way we evaluate model performance on the validation data in a way that mimics the deployment scenario. Since we trained the model on a different data distribution, we must perform a linear transformation of the predicted probabilities generated by the model to correct for the sampling used in training.
To correct the predicted probabilities, we need the following quantities:
And some derived quantities based on these rates, with p1 being the predicted probability of the target being 1 and p0 being the predicted probability being 0. Since p1 + p0 we only need to calculate the adjustment to p1, but we do both anyways.
We calculate the correction to the predicted probabilities:
/*adjust model predicted probabilities based on the event-based sampling we performed, this correction will scale the predicted probabilities*/
%let zerorate = 0.9983;
%let onerate = 0.0017;
%let zerosample = 0.5;
%let onesample = 0.5;
data casuser.eventbasedsampling_scored;
set casuser.eventbasedsampling_scored;
Num1 = P_class1 / (&onesample / &onerate);
Num0 = P_class0 / (&zerosample / &zerorate);
P_class1_adj = Num1 / (Num1 + Num0);
P_class0_adj = Num0 / (Num1 + Num0);
drop Num1 Num0;
run;
Now we use the ASSESS Procedure like we did with the baseline model to evaluate the event-based sampling model performance on the original validation data.
/*assess model performance on the validation data, using the adjusted probabilities*/
proc assess data=casuser.eventbasedsampling_scored ROCout=casuser.eventbasedsampling_roc;
var p_class1_adj;
target class / event="1" level=nominal;
by _partind_;
run;
/*for now let's start by evaluating the model at the 0.5 cutoff*/
data casuser.eventbasedsampling_eval;
set casuser.eventbasedsampling_roc;
approach='event based sampling';
if _partind_ = 1 then partition='valid';
else partition='train';
where (_Cutoff_ <= 0.505 and _Cutoff_ >= 0.495);
keep approach partition _Cutoff_ _TP_ _FP_ _FN_ _TN_ _Sensitivity_ _Specificity_ _FPR_ _ACC_ _MiscEvent_ _C_ _F1_;
run;
/*let's do a running comparison of our models*/
data casuser.combined_eval;
set casuser.baseline_eval casuser.eventbasedsampling_eval;
run;
proc print data=casuser.combined_eval;
run;
It looks like using event-based sampling to balance the data doesn’t really help us in identifying true fraud cases in the data (we have fewer TP and more FN in validation data), but it does improve the model’s C statistic (the area under the ROC curve). Misclassification is not a great way to judge models when working with rare target events, but we can see that it is also degraded when using event-based sampling. In general, when working with rare-event models it’s best to judge the model performance on business value, where generally False Positives and False Negatives will have different business consequences.
This concludes Part 1 of this series on sampling techniques for models with rare target events. In Part 2 we explore augmenting the data with synthetic rare events, and we look at the important impact that selecting a cutoff has on model performance. We also spend a bit more time detailing how to evaluate model performance for rare target event models.
Find more articles from SAS Global Enablement and Learning here.
Nearly 200 sessions are now available on demand with the SAS Innovate Digital Pass.
Explore Now →The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.