Do you know how much fraud your model is detecting?

Started ‎07-19-2022 by
Modified ‎07-19-2022 by
Views 1,225

Machine Learning models and advanced analytics can be a key part of our fraud detection system, but their complexity makes it hard for us to understand what is under it, and why and how much we should trust it.

In order to give it the weight in the process it deserves, we must learn how good the performance of the model is, how good is it at detecting fraud?

Introduction: The performance of a model is the evaluation of its results and can be done using a variety of methods. We should understand not only how much fraud we are detecting, but also how much we are missing, or how many false positives we are generating. This combination of questions is answered by monitoring the performance of a model, then we can be aware of its strengths and weaknesses, train new models if necessary or adjust the threshold to suit our business needs.

Confusion matrix: The confusion matrix is the base for many performance checks. Once you have your model trained, you should use it to score a validation data set with a known fraud flag; cases that have been investigated and confirmed as fraudulent, with that you will be able to see the score of the model (if a transaction is predicted to be fraudulent or not) and the reality of that transaction. Was it really fraud? Following that logic, you will have:

 Predicted Fraud True Fraud: Transactions that were predicted to be fraud that were fraud Ex: 100 False Fraud: Transactions that were predicted to be fraud that weren’t fraud Ex: 5 Non Fraud False Non-Fraud: Transactions that were predicted not to be fraud that were fraud Ex: 15 True Non-fraud: Transactions that were predicted not to be fraud that weren’t fraud Ex: 210 Fraud Non Fraud Observed

Where the top left and the bottom right would be the numbers with correctly predicted results, and the other two the incorrectly predicted. For a model to be considered good, we would expect most of our transactions to go under the diagonal (as in the example above).

Sensitivity and Specificity: Once you have the confusion matrix and the total number of transactions in your validation data set, you can build very interesting ratios.

The sensitivity is the amount of fraud the model is detecting. From all the known fraudulent cases, the percentage that our model identified correctly as fraud.

Sensitivity =

The specificity is the amount of non-fraud detected. From all the non-fraudulent cases, the percentage that our model identified correctly.

Specificity =

Other relevant measures:

False Positives =

False Negatives =

These values are not only going to help you to understand the performance of the model we have, but also the possibility to adapt the threshold to our business needs.

The threshold is the probability of fraud in which we start alerting to our investigators. The model is going to give us a score from 0 to 1 indicating the probability of a new transaction to be fraudulent. We could decide, for instance, that whenever the probability reaches 0.7, we consider generating an alert, and below it, not generating it. Once we have checked the model performance and trusting that the score is reasonable, we can change the threshold to suit our needs. For example, if our strategy is to detect as much fraud as possible, we could lower our threshold and receive alerts whenever a transaction has 0.5 probability of being fraud, or even less; our false positives are going to increase, but also our detection rate. In the opposite case, we may have very limited resources to undergo investigations, due to budget or any other reasons. Then we can raise the threshold and alert the transactions that have 0.9 probability of being fraud; our false positives are going to decrease, the percentage of true fraud out of all the alerts is going to increase at the cost of reducing detection rates. We can then balance our triage system with such changes, taking always into consideration that we are modifying our tolerance to fraud, not the performance of the model.

ROC curve: If we bring the sensitivity and specificity in a graph, we can create the ROC curve. This is visualizing the true fraud (sensitivity) against the false fraud (1 – specificity). If the model was taking random decisions (without any criteria) our ROC curve would be a diagonal (orange line) and having a true fraud (sensitivity) at 1 and false fraud (1 – specificity) at 0 would mean that our model is perfect (green line) and predicts every transaction correctly. The closer to the green line, which indicates perfection and the far from the orange line of randomness, the better our model is.

This chart is very useful for two reasons:

• Evaluation of the performance of the model: as close to being a square the better the model is predicting. You can calculate the Area Under the Curve (AUC) or Area Under the ROC and use this parameter to evaluate your model (from 0.5 to 1, being 1 a model that predicts perfectly).

• Threshold decision-making: considering business reasons we can increase or decrease the threshold to detect fraud, with the ROC curve we can see the representation of the impact so we can easily detect the most optimal point that balances the rates.

Conclusion: Understanding the strengths and weaknesses of your models is going to help you improve your decision engine or triage systems when looking for fraud. The key is in the combination of parameters available, as explained (such as sensitivity, specificity, false positives, or ROC) and understanding your business needs to adapt the threshold or the weight aligned to it. Machine Learning models are here to help!