The purpose of this blog is to explain the basic concepts of anomaly detection and to compare supervised and unsupervised anomaly detection methods. This blog will use a supervised gradient boosting model (PROC GRADBOOST) and an unsupervised support vector data description model (PROC SVDD) to detect anomalies in a sample dataset.
Basic Introduction to Anomaly Detection
Anomalies are rare events that deviate significantly from most of the data. These anomalies could be extreme events that are generated in the same ways as the normal data (extreme weather events are caused by the same underlying process as any other weather event), but they could also be events that are generated in a completely different way from the normal data (fraudulent bank transactions are not performed for the same reasons as normal bank transactions). Detecting these anomalies in data can be useful for a variety of reasons, and the value in detecting them will be different depending on the type of data:
One key takeaway from these different examples is that the goal in anomaly detection isn’t always to just remove the anomalies from the data, sometimes we are more interested in the anomalies than in the normal data. This contrasts with the traditional statistical treatment of outliers, which is usually to exclude them from the data analysis. Sometimes we want to exclude anomalies from our data, but there are many situations where we don’t care about the “normal data” and are mostly interested in characterizing anomalies.
Supervised and Unsupervised Anomaly Detection
If we have historical data with a labeled target indicating the presence of anomalies in the data, we can use any supervised learning algorithm to perform anomaly detection. This is no different from traditional predictive modeling with a labeled target, although since anomalies are generally rare in the data, we often need to perform event-based sampling to ensure that the predictive model captures the relationship between the inputs and the target.
Consider a manufacturing line example where we have historical data about conveyor belt failures where the manufacturing line gets clogged, and our product starts to break and fail. These failures are rare (let’s say 99.9% of the data corresponds to normal conveyor belt operation, and 0.1% of the data corresponds to the failure mode), but we plan to increase profitability by predicting these conveyor belt failures before the product starts to break and fail. We could try to fit a predictive model like a neural network or a random forest to predict failure on the historical data, but our model would be 99.9% accurate in the training sample by predicting that failure never occurs. This is a generically good accuracy score for a predictive model, but in this case the model is useless because it fails to identify a single anomaly. A better approach would be to take an event-based sample of the data, so we select all 0.1% of failure cases and then randomly sample the remaining 99.9% of non-failure cases to get another 0.9% of the data. Overall, we only use 1% of the original data, but now anomalies are 10% of the sample and there is a much better chance that the model can capture the relationship between the inputs and the anomalies. The key idea is that our goals is to identify anomalies, so all the extra data about the non-failure cases does not help much in achieving our goal.
Predictive modeling algorithms are powerful tools for capturing relationships between inputs and targets, but when performing anomaly detection, we don’t always have labeled data and it isn’t necessarily feasible to collect this kind of data. In the previous example we had historical data about manufacturing line failures, but if we didn’t have this kind of data we probably don’t want to go out and break the manufacturing line to collect nice examples of anomalies. Without a labeled target, we must use unsupervised approaches to anomaly detection. The fundamental idea with unsupervised anomaly detection is to build a model that characterizes the normal data (through clustering, model building approaches, or semi-manual approaches), and then classify anything that deviates too much from the normal data as an anomaly. This approach is especially convenient for two related reasons:
It’s difficult to characterize all the ways that a system can break or all the methods that malign actors use to perpetrate fraud and malfeasance. On the other hand, it is usually easy to collect data and characterize the normal operation of a system. Unsupervised anomaly detection methods model the normal operational state and allow us to identify unusual behavior as anomalous even when this behavior is brand new.
Training and Deploying Anomaly Detection Models and Evaluating Performance
For supervised anomaly detection models, we must evaluate model performance on a validation sample that was not used to train the model. This means that we must first partition our historical data (which in this case consists of inputs and a target variable) into a training sample for building the model and a validation sample for evaluating model performance. Once we have built the model using the training sample, we score the validation sample to calculate model performance metrics. Standard assessment measures for binary targets like misclassification, accuracy, and area under the ROC curve can be useful for assessing model performance, but it is important to remember our modeling goal, which is to detect anomalies. As mentioned earlier, high accuracy is not useful in the model if we are not identifying anomalies with the model. Two assessment measures that are particularly useful for anomaly detection are precision and recall:
High values of precision mean that most of the detected anomalies were truly anomalous, while high values of recall mean that our model doesn’t miss many anomalies in the data. Ideally, we would want high values of both precision and recall, but in practice we often must make a trade-off between models that detect a lot of anomalies and models that do a good job avoiding false positives. Deciding how to make this trade-off usually depends on the problem context, especially the cost associated with missing anomalies (too many false negatives) versus identifying more anomalies than exist (too many false positives).
Let’s go back to our conveyor belt manufacturing line example, in this case we want to use anomaly detection to identify conveyor belt failures before they cause a clog and destroy our products. For each anomaly we detect we will temporarily shut down the conveyor belt and manually inspect the products to ensure that they won’t get clogged. If we fail to detect an anomaly, the products will get clogged and start to break, requiring manual intervention to fix the clog and remove the waste. In this case false positives would waste money by forcing us to turn off the conveyor belt and manually inspect it for no reason, while false negatives would lead to clogged products and waste. The relative cost of these two activities (manual inspection and downtime vs. broken product and waste) will determine whether we care more about maximizing precision or recall. More generally the business costs associated with false positives and false negatives can be used to help determine how sensitive the model should be when detecting anomalies. If we are detecting anomalies to avoid catastrophic failure, we might be willing to accept a precision of 0.5 (so 50% of the detected anomalies are just normal cases) to achieve a recall of 0.99 (we capture 99% of all anomalies with the model).
If we have a labeled target and are using supervised learning techniques, we can just calculate the precision and recall on the validation data to evaluate model performance. If we are using unsupervised anomaly detection techniques, we will just have normal operating data with no labeled anomalies. In this case we need to deploy the model on real data or find examples of anomalies to see how the model performs. We can still calculate precision and recall, but only if we have new data containing known anomalies that we can use to judge model performance.
The main advantage of the unsupervised approach is that we don’t need labeled data to train the model, enabling us to build anomaly detection models in situations where it is simply impossible to build predictive models. If we do have labeled data, we can still perform honest assessment on these unsupervised models by seeing how they work on validation data containing anomalies. Training these models will involve only the non-anomalous data, so if we have labeled data our validation sample will include all the anomalies, and to do a good job with honest assessment it should also include some normal data.
Demonstration: Comparing Supervised and Unsupervised Methods for Anomaly Detection
Let’s make this discussion more tangible by exploring a couple of examples, one using a gradient boosted decision tree model (a popular machine learning algorithm) to perform supervised anomaly detection, and one using a support vector data description model (a popular unsupervised anomaly detection algorithm). We will skip the details of the algorithms for now and focus on the data setup for training and evaluating model performance. For this example, we will use the ionosphere data from the UCI machine learning repository, it contains antennae readings from radar signals that were used to probe the ionosphere to map its structure. Some of the radar signals hit the ionosphere and return to the antennae, providing information about the structure of the ionosphere, but some of the radar signals pass right through the ionosphere, revealing nothing. These useless radar signals are the anomalies we want to detect. This dataset does have labeled target called class, indicating whether the radar signals was anomalous. The inputs for this dataset are 32 variables, var000 through var0031, corresponding to the real and imaginary components of 16 complex electromagnetic waves (radar signals) measured by 16 different antennae. We try to use these radar variables to classify signals as anomalous or normal.
cas;
caslib _all_ assign;
proc casutil;
load file="/greenmonthly-export/ssemonthly/homes/Ari.Zitin@sas.com/ionosphere.csv"
casout="ionosphere" replace;
quit;
proc freq data=casuser.ionosphere;
table class;
run;
This is a bit of a toy example since this dataset only has 351 observations, 126 of which are anomalies. So already we have a good balance of normal observations and anomalies, and of course we have a labeled target variable indicating anomalous cases.
proc partition data=casuser.ionosphere
samppct=20
partind;
by class;
output out=casuser.ionosphere;
run;
We start by partitioning the data into a training sample and a validation sample. This is possible because we have a labeled target variable (class), and it is necessary to perform honest assessment when building the supervised predictive model.
proc gradboost data=casuser.ionosphere(where=(_PartInd_ = 0))
outmodel=casuser.ionosphere_gradboost_model
seed=919;
input var_0000-var_0031 / level=interval;
target class / level=nominal;
run;
We fit the gradient boosting model using the GRADBOOST procedure, using only the training data to fit the model. There are some hyperparameters we can choose for gradient boosting models, and we can even autotune these models. For this example, we will stick with the default settings on both the gradient boosting algorithm and the support vector data description algorithm, but when building models to deploy it is important to tune the hyperparameters to improve model performance. We output a model table that we use for scoring the validation data separately.
proc gradboost data=casuser.ionosphere(where=(_PartInd_ = 1))
inmodel=casuser.ionosphere_gradboost_model;
output out=casuser.gb_valid_scored copyvar=class;
run;
This second run of the GRADBOOST procedure scores the validation data using the trained model and then outputs the scored data into an in-memory CAS table. We will come back to this table and calculate precision and recall after we fit and score the unsupervised model.
proc svdd data=casuser.ionosphere(where=(_PartInd_ = 0 and class=0));
id id;
input var_0000-var_0031 / level=interval;
kernel rbf / bw=mean;
savestate rstore=casuser.ionosphere_svdd_astore;
run;
We use the SVDD procedure to fit the unsupervised support vector data description model. At the most basic level this model tries to draw a boundary tightly around the normal data (in this example in the 32-dimensional input space), and then all observations inside the boundary are normal and all observation outside the boundary are anomalous. Notice that when training this model, we use the training data (_PartInd_ = 0), but we also only trained on the normal data (class = 0). This is an situation where we could train the model even without a labeled target, although we would not be able to evaluate performance without at least some examples of the anomalies. Here we do have labeled target data, but we will only use it for model performance evaluation. We output a model ASTORE file that we can use for scoring the validation data separately.
proc astore;
score data=casuser.ionosphere(where=(_PartInd_ = 1))
rstore=casuser.ionosphere_svdd_astore
out=casuser.svdd_valid_scored copyvar=class;
run;
We use the ASTORE procedure to score the validation data using the unsupervised SVDD model. In this case we include both the normal data and the anomaly data to see how the model does at distinguishing the anomalies from the normal data. This contrasts with the training process, where we only provided the normal data and did not include any examples of anomalies. Now that we have scored data for both the supervised and unsupervised modeling approaches, we can calculate precision and recall for each model.
ods output ROCInfo=gb_roc;
proc assess data=casuser.gb_valid_scored;
var P_class1;
target class / event="1" level=nominal;
run;
data work.gb_stats;
set work.gb_roc;
where (CutOff < 0.51 and Cutoff > 0.50);
precision = TP / (TP+FP);
recall = TP / (TP+FN);
keep precision recall;
run;
data casuser.svdd_valid_scored;
set casuser.svdd_valid_scored;
if _SVDDSCORE_ = -1 then PredClass = 0;
if _SVDDSCORE_ = 1 then PredClass = 1;
run;
ods output ROCInfo=svdd_roc;
proc assess data=casuser.svdd_valid_scored;
var PredClass;
target class / event="1" level=nominal;
run;
data work.svdd_stats;
set work.svdd_roc;
where (CutOff < 0.51 and Cutoff > 0.50);
precision = TP / (TP+FP);
recall = TP / (TP+FN);
keep precision recall;
run;
title 'Supervised Gradient Boosting Anomaly Detection';
proc print data=work.gb_stats noobs;
run;
title 'Unsupervised Support Vector Data Description Anomaly Detection';
proc print data=work.svdd_stats noobs;
run;
For each modeling approach we use the ASSESS procedure to calculate the confusion matrix across a range of probability cutoffs. For the SVDD model we don’t have any predicted probabilities (just predictions about whether we have anomalies), but we can still use PROC ASSESS to calculate the confusion matrix. Once we have the confusion matrix for the 0.5 probability cutoff we can calculate precision and recall for each model (the value of this cutoff matters and changing it could affect performance for the gradient boosting model but doesn’t matter for the SVDD model). We get the following results for the two models:
The gradient boosting model seems to outperform the SVDD model overall, although the SVDD model does capture all the anomalies in the data. The gradient boosting model does a better job of distinguishing between the anomalies and the normal data while the SVDD model is more sensitive and detects more anomalies. We could change the probability cutoff for the gradient boosting model to make it more sensitive, increasing recall at the cost of precision. The supervised learning approach ends up yielding a better model overall, but we can’t always choose this approach since we don’t always have labeled examples of anomalies. The advantage of the SVDD model is that we can still make progress in detecting anomalies even when we haven’t collected data about those anomalies. If we have a new conveyor belt on a manufacturing line and it has never broken, we don’t want to have to break it in a variety of ways to collect data on various kinds of anomalous failures. Instead, we can use an unsupervised anomaly detection method to look for deviations from the normal behavior of the new conveyor belt.
To build your own supervised and unsupervised anomaly detection models using SAS Viya check out the documentation below on the procedures used in this blog.
References/Useful Links:
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.