The purpose of this post is to learn about the variety of time series anomaly detection approaches available in literature. There are a lot of different time series anomaly detection algorithms, many of which vary in small ways. To avoid getting mired in the details we will explore the basic ideas behind these algorithms and how they are used to detect time series anomalies. A subsequent post in this series will explore the details of some of these methods and present examples using SAS software.
We can separate the algorithms for time series anomaly detection into groups based on how they approach looking for anomalies. These algorithms come from diverse research communities (forecasting, data science, statistics), and they each have their own benefits and drawbacks. The main thing they all have in common is that they allow us to create some kind of ‘anomaly score’ for points or subsequences in a time series. We can then set a threshold value for this score to identify anomalous subsequences in the time series data. This article addresses the following approaches:
An important difference to note between some of the approaches is that some expect the training data to contain both normal data and anomalous data, while others are expected to be trained on only the normal data, meaning that we don’t even need any examples of anomalies to train the anomaly detection models. Approaches that only expect normal data can be especially useful when anomalies are rare or represent some catastrophic failure; we wouldn’t want to force the system to fail spectacularly and cost the business just to generate useful data for anomaly detection. Alternatively, we might have a long historical record of anomalies in our data and if we want to use that information productively, we should choose an approach that doesn’t just expect normal data. Having labeled historical anomalies is helpful in evaluating the effectiveness of models, but in general anomaly detection algorithms don’t require labeled anomalies for training.
Reconstruction based methods are designed to build a model to capture the normal behavior of the time series and thus are trained on data that we expect don’t include any anomalies. The basic idea of these methods is to encode (compress) the normal time series (training data) into a lower dimensional latent space, and in the process also train a model to reconstruct (or decode) the original input time series from the latent representations.
Model deployment involves scoring new data using the model, so the new data is compressed and then reconstructed. The model was trained on only the normal time series data, so it should be less effective at reconstructing the anomalous time points. We can calculate the reconstruction error as the difference between the original inputs (the data we want to score) and the reconstructed version of these inputs after passing through the model (both the encoder and the decoder that reconstructs the inputs from the compressed versions). The model was trained to minimize the reconstruction error on the training data, which included only the normal data. Anomalous data will be harder to reconstruct since it wasn’t present in the training data, and thus the reconstruction error is used as the ‘anomaly score’ for this time series anomaly detection approach.
Although the model is trained on normal data only, it can be useful to have labeled examples of anomalies in the time series data to test the model’s performance and establish a good threshold value of the reconstruction error for identifying anomalies. The actual magnitude of the reconstruction error will depend on the data used to train the model and the training process of the model, so threshold values for detecting anomalies will be different for every problem and every model (changing the model hyperparameters can change the magnitude of the reconstruction error on the normal data). Calculating the reconstruction error for known anomalies and comparing it to the reconstruction error for the normal data can help in choosing good threshold values for anomaly detection.
Neural network autoencoders are a good example of a reconstruction-based method for anomaly detection. The choice of architecture and layers in the autoencoder are flexible (recurrent or convolutional layers could be useful for processing time series data), but as long as the autoencoder is trained on normal data, it can be used to score new data for anomalies, flagging anomalous time points where the autoencoder does a poor job reconstructing the input data.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
The circled bottleneck layer represents a compression of the information from the original inputs. When the autoencoder is trained on the normal data it learns to do a good job reconstructing this normal data, but it does a poor job reconstructing anomalous data points that don’t look like the normal data. Autoencoders are used for a variety of tasks, one example is feature extraction, where the values from the bottleneck layer are used as features in other models. When using autoencoders for anomaly detection we don’t care about the bottleneck layer (the compressed values), instead we are interested in the output (the predicted inputs) since we will compare those to the original inputs to calculate the reconstruction error. Note that in this post we are focused on time series anomaly detection, but autoencoders can be used for anomaly detection on a variety of datasets, including image data and tabular data.
When thinking about time series data the most common data analysis task is often forecasting future values of the time series. Forecasting time series has a long history and there are many different approaches that range from simple exponential smoothing models to complex nonlinear recurrent neural network models. Regardless of what models we use for forecasting, we get predictions for futures values of the time series based on past values.
To build anomaly detection models based on forecasting we build a forecast on normal data (so it is only designed to do a good job forecasting the normal conditions, it isn’t trained to forecast data with anomalies). We can then compare the true value of the time series to our forecast value. If the forecast is accurate, it’s likely that the data is in the normal condition, but the less accurate the forecast is, the more likely it is that the time point or subsequence being forecasted is an anomaly. Like the reconstruction error described in the previous section, we can calculate a forecast error as the difference between the true series value and the forecasted value. Since the forecast is designed to only be accurate on normal data, this forecast error is used as our anomaly score. Of course, we must choose a useful threshold value to detect anomalies, which must be set by investigating known anomalies in the data and choosing a threshold forecast error value that separates the anomalies from the normal data.
One thing to note about using forecasting for time series anomaly detection is that we are not actually forecasting future values of the time series. We calculate the forecast error and use that as an anomaly score; thus we need both the forecasted value of the time series, and the actual true value (analogous to how in the reconstruction methods we need the original inputs and the reconstructed inputs to calculate the reconstruction score). This doesn’t pose any problems with deployment, because we are not trying to use the forecast to predict future anomalies, we are using it to detect anomalies as they occur. When deploying forecasting-based anomaly detection models we use the forecast to predict each incoming time series value (or sequence of values), calculating the forecast error at each time as it occurs. Large forecast errors indicate a deviation from the pattern of normal data used to build the forecasting model.
In this graphical example we see the normal time series in the top panel, with a forecast calculated on the normal time series. In this toy example the forecast does a great job capturing the sinusoidal pattern in the data, but when we apply the forecast to data with a simple contextual anomaly the forecast completely fails to accurately predict this anomaly (bottom panel).
We can calculate the forecast error as the absolute value of the difference between the time series with the anomaly and the forecast trained on the normal data. In the example above we use the absolute value of the error, but of course we can use any error metric we want, and this choice will depend on the application, the kinds of anomalies you intend to detect, and the forecast model you choose to use. This ‘Absolute Forecast Error’ value is used as our anomaly score, and a threshold of 0.5 is chosen based on the ability to detect our one known anomaly. Choosing a good anomaly score threshold is really a separate problem from building the anomaly detection model to generate the anomaly score; often we don’t need examples of anomalous data to build the model, but we will always need examples (or at least simulations) of anomalies to choose good threshold values for the anomaly score.
Distance based methods calculate specialized distance metrics between points or subsequences in time series. These distances can be compared to one another to identify points or subsequences that are unusually far apart, indicating anomalies. Time series distance metrics are used for other tasks as well, like clustering and classifying time series. Distance measures can be calculated between whole time series (usually for clustering or classification), or between points or subsequences in a single time series (usually for anomaly detection).
Time series anomaly detection is usually performed by creating many subsequences from the original time series using a sliding window with a stride of 1 (so there is overlap between the subsequences). Once we have these subsequences we calculate distance between them, looking for the ‘nearest neighbor’ (the smallest distance between subsequences) for each subsequence. Those subsequences with a large distance between themselves and their nearest neighbor (above some specified distance threshold) are determined to be anomalous. Thus the ‘distance to the nearest neighbor subsequence’ is used as the anomaly score for this approach. This approach is generally called the ‘matrix profile’ technique and involves calculating a series of distance values for each subsequence as the distance between that subsequence and its nearest neighbor. This ‘matrix profile’ sequence can be used as the anomaly score and there are computationally efficient algorithms to calculate it.
When working with these distance-based approaches the major choices involve the length of the subsequences analyzed and the choice of distance metric to use when calculating distance between series. The subsequence lengths are selected by the analyst when breaking the time series into subsequences, but there is often limited guidance in the data for the choice of subsequence length. In general, shorter subsequences make it easier to detect extreme anomalies in a short period of time, but longer subsequences can make it easier to find hidden anomalies (anomalous patterns in the data that don’t necessarily take on extreme values). One good way to think about choosing the subsequence length is to think about how we plan to deploy the anomaly detection model. If we use subsequences that are too long, we may not be able to detect anomalies until it is too late, but we do want the subsequences to be long enough to represent the anomalies we hope to detect. The distance metric is important in determining how we compare the subsequences. There are many different time series distance measures we can use (too many to list here), but the basic considerations are often about how the time series are aligned when comparing them.
The traditional Euclidean distance is the easiest to understand, the alignment is one-to-one, so the time series values at the first time point in each subsequence are compared to one another, and so on for the second, third, and all the other time points. This means that series with the same information, just offset in time, will have a potentially large distance between them, depending upon how fast the series changes. A big advantage of Euclidean distance is how easy it is to calculate (both conceptually and computationally).
Dynamic time warping is another distance metric (really a family of associated distance metrics) that accounts for misalignment in the time series. The basic idea is that distance is calculated between points in a window around each time point, so the first time point in one subsequence will be compared to multiple time points in other subsequences. This allows misaligned but similar series to have a smaller distance between them, reflecting the similarities in the series that are not captured by the Euclidean distance metric. A big disadvantage of dynamic time warping is computational difficulty in calculating the distances, especially in big data settings (like time series anomaly detection). In general, it makes more sense to use dynamic time warping distances for anomaly detection with time series, but we might need to use Euclidean distance if we run out of computational resources when calculating dynamic time warping distances.
The reconstruction approaches discussed earlier involved encoding the time series into a compressed representation and then reconstructing the inputs from the compressed representation. The reconstruction error is often used as the anomaly score. Encoding based approaches follow a similar idea and project subsequences into a lower-dimensional latent space, but they don’t include the reconstruction step and look for anomalies directly from the derived encodings. A key difference is that since we don’t have a decoder (no reconstruction step), the strategy to encode the input time series must yield encodings that are useful in detecting anomalies (encodings where we can easily identify some kind of useful anomaly score).
There are a few different kinds of encoding anomaly detection algorithms, each with very different low-dimensional latent spaces, ranging from string encodings like the SAX encoding, to encoding the time series into a directed cyclic graph. One approach, the TARZAN algorithm, uses a suffix tree calculated from the string encodings for the normal data. Each subsequence is thus represented by a string of characters, and these strings can be compared to find differences between subsequences. An anomaly score is calculated as the difference between the expected number of occurrences of a subsequence in the suffix tree for the normal data (the reference tree) and the observed number of occurrences of the subsequence in the suffix tree for the potentially anomalous series of interest. The basic idea is that a time series is in an anomalous state if unusual subsequences are appearing at a rate much greater than in the normal state.
The basic idea of distribution-based approaches to time series anomaly detection is to estimate the distribution of the normal (non-anomalous data) and then compare time series points or subsequences to this distribution, identifying anomalies as those points that fail to match the distribution. This is like the distance-based approaches discussed earlier, but instead of comparing points or sequences based on their similarity, they are judged more on frequency of occurrence in the data. Infrequently occurring points or subsequences thus get flagged as anomalies by distribution methods.
There is significant overlap between distribution and distance-based approaches to time series, using the Mahalanobis distance (based on the distribution of the data) as the distance metric for a distance-based anomaly detection algorithm is essentially a distribution method. In this case the Mahalanobis distance is a normalized statistical difference based on the mean and standard deviation of the data, so calculating this distance between points is essentially a calculation of where the point lies in the statistical distribution of the data.
The anomaly score for these approaches is usually either a probability, a likelihood, or some kind of statistical distance measure. In all three cases we are comparing time series points or subsequences to a reference distribution learned on normal data to identify anomalies that do not match the learned distribution. Our anomaly scores are either probabilities or likelihoods that the point/subsequence does not match the distribution, or distance measures from the mean or other central measure of the distribution.
Isolation tree approaches to time series anomaly detection are generally ensemble models (often called isolation forests) made up of ‘overfit’ decision trees that isolate samples in the leaf nodes of the tree. The basic idea is that a decision tree is fit on the time series data using a collection of time series features to predict the time series values. The tree is ‘overfit’ to the data so that each point in the time series is ‘isolated’ or perfectly predicted in its own leaf node by the decision tree. Anomalous time points or subsequences here will be easier to reach from the root node, because they will be notably different from the normal time points. Our anomaly score is thus the path length down the tree from the root node to the node that isolates the time point. Shorter paths here correspond to more anomalous data, while longer paths correspond to normal data. The idea is that the normal data is all pretty similar to itself, so it will take a lot of splits to isolate it, but the anomalous data will be different in a variety of ways, so it shouldn’t take a lot of tree splits to isolate the anomalous points.
To move from an individual isolation tree to an isolation forest we randomly sample time series features and split points when building trees, thus each tree in the forest is different (has different splits and uses different features), but each one is designed to isolate time series points. The anomaly score is then calculated as the average path length to isolate a time series point across all of the trees in the isolation forest. This ensemble approach has the benefit of being more flexible when detecting a variety of different kinds of anomalies, and ensures that we are not just looking for anomalies with respect to a single time series feature, but instead with respect to a variety of time series features. The ensemble models are also computationally easy to fit, since each tree in the isolation forest can be trained independently of the others.
The breakdown of the different approaches to time series anomaly detection presented here is an overview of a large number of different time series anomaly detection algorithms developed in the field. The goal is to break down the different approaches into broad groups to make it easier for practitioners to select the approach that is either most relevant to their problem, or most familiar based on their background. There are certainly other ‘taxonomies’ used to organize these methods as well, although there is a lot of overlap between every strategy to organize the methods. Both the reconstruction and forecasting based approaches can be seen as ‘prediction-based’ anomaly detection methods, while the distribution, encoding, and isolation tree methods can be seen as density-based methods. We will explore some examples of using SAS tools to perform time series anomaly detection in the third post in this series.
References:
Find more articles from SAS Global Enablement and Learning here.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.