Comparing Anomaly Detection Algorithms for Detecting Challenging Anomalies in Time Series Data

1 Like

The purpose of this post is to learn how to build, improve, and compare a few different time series anomaly detection models. This is a follow-up to a series of previous posts on time series anomaly detection. Previous blogs focused on the basics of time series anomaly detection and a simple example using an isolation forest. This post will focus on trying to detect harder anomalies by selecting different algorithms and tuning model hyperparameters. The goal is to focus on the different modeling approaches and how they succeed or fail in detecting a few different challenging time series anomalies. Part of this goal is to emphasize that we are unlikely to find a single model that can detect all kinds of anomalies in a wide variety of data, so trying out different anomaly detection approaches can be helpful when working with new data.

The data used in this post all comes from the UCR Anomaly Detection archive (mentioned in previous posts), but instead of just using the Asian Citrus Psyllid dataset, we will also use another with significantly less obvious anomalies. Let’s briefly explore the two datasets to build some intuition about the anomalies and how hard they are to detect.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The anomaly in the psyllid EPG data analyzed in the previous post is circled in red in the plot above, around time point 17250. This anomaly is so obvious that we probably didn’t need to put a red circle around it in the above picture, it stands out from the other data in an unambiguous way. In fact, we really didn’t need an isolation forest to detect this anomaly, we could use the simple rule that all EPG values outside of the range (0.1,0.3) are anomalous:

data casuser.psyllid_anomaly_manual;
    set casuser.psyllid_anomaly;
    where (epg >= 0.3 or epg <= 0.1);
run;

proc print data=casuser.psyllid_anomaly_manual;
run;

This simple rules-based approach succeeds in finding the anomaly, and in fact it finds the anomaly 8 time points earlier than the isolation forest approach in the previous post (which first flagged an anomaly at timeID 17234 instead of timeID 17226 with the manual approach above). This isn’t a reflection on the isolation forest algorithm, which can find anomalies in more complex data (especially when provided more input variables/extracted time series features), but is due to the simplicity of the data. A key takeaway is to be sure to investigate the kinds of anomalies the business wants to detect in detail before committing to a specific approach or implementing a complex and computationally intensive anomaly detection algorithm.

This electrocardiogram (ECG) data represents a person’s heartbeat, and although the baseline ECG reading ‘wanders’ a bit (the scale seems to change throughout the series), there is only one real anomaly (circled in red). Each of the spikes in the data is a heartbeat, and the circled beat is an anomalous V beat that is indicative of premature ventricular contraction (PVC), which can increase the patient’s risk of heart problems. Although this anomaly is visually clear when circled in the above plot, it isn’t immediately obvious how to create an easy rule to separate this anomaly from the regular heartbeat data. It doesn’t necessarily look any more anomalous than the changes in the heartbeat baseline (from just looking at the plot), but it is the only medically significant deviation from a normal heartbeat in this data. Let’s try to detect this anomaly using the isolation forest from the previous post:

cas;
caslib _all_ assign;

proc casutil;
    load file='/home/AnomalyDetection/Data/194_UCR_Anomaly_sddb49_20000_67950_68200.txt' 
    importoptions=(filetype="csv" getnames="false")
    casout="heartbeatV_anomaly" replace;
quit;

data casuser.heartbeatV_anomaly / single=yes;
    set casuser.heartbeatV_anomaly;
    timeID = _N_;
    rename Var1=ecg;
run;

data casuser.heartbeatV_anomaly_train casuser.heartbeatV_anomaly_test;
    set casuser.heartbeatV_anomaly;
    if timeID <= 20000 then output casuser.heartbeatV_anomaly_train;
    else output casuser.heartbeatV_anomaly_test;
run;

/*normal heartbeat ECG data*/
title 'Normal heartbeat ECG';
proc sgplot data=casuser.heartbeatV_anomaly(obs=1000);
    series x=timeID y=ecg;
run;

/*single anomaly, V beat (premature ventricular contraction), from 67950 to 68200*/
title 'Anomalous heartbeat ECG';
proc sgplot data=casuser.heartbeatV_anomaly(firstobs=67500 obs=68500);
    series x=timeID y=ecg;
run;

One way that we can try to account for the wandering baseline ECG reading is to standardize the data based on the mean and standard deviation of our samples. This can be more complex with time series data than with tabular data because we must choose a window for the standardization. In this example we will standardize the training and test datasets separately, but in deployment scenarios it is worth it to standardize a segment of time series values that represents an interval in which we would want to detect anomalies. Things get more complicated when dealing with streaming data, where we would need to consider rolling windows for scoring and standardizing.

/*data preprocessing*/
proc stdize data=casuser.heartbeatV_anomaly_train out=casuser.heartbeatV_anomaly_train;
    var ecg;
run;

proc stdize data=casuser.heartbeatV_anomaly_test out=casuser.heartbeatV_anomaly_test;
    var ecg;
run;

Now we fit the isolation forest model on the training data and apply it to the test data, skipping time series feature extraction for simplicity (including more relevant features could improve performance).

proc forest data=casuser.heartbeatV_anomaly_train isolation seed=919;
    input ecg / level=interval;
    id timeID;
    output out=casuser.heartbeatV_anomaly_train_score copyvars=(_ALL_);
    savestate rstore=casuser.isolation_forest_astore;
run;

proc astore;
    score data=casuser.heartbeatV_anomaly_test
          rstore=casuser.isolation_forest_astore
          out=casuser.heartbeatV_anomaly_test_score;
quit;

proc sgplot data=casuser.heartbeatV_anomaly_test_score;
    series x=timeID y=_Anomaly_;
run;

Based on this plot, it appears that the model has identified some anomalous between time points 45000 and 60000. We are looking for an anomaly around time point 68000, so it doesn’t seem like the isolation forest has captured the anomaly of interest. Let’s look in detail at the anomaly scores for the known anomaly and compare them to the anomaly scores for the normal heartbeat:

/*normal heartbeat data*/
title 'Normal heartbeat ECG Anomaly Score';
proc sgplot data=casuser.heartbeatV_anomaly_test_score(where=(timeID<=21000)); 
    series x=timeID y=_Anomaly_; 
run; 

/*single anomaly, V beat*/ 
title 'Anomalous heartbeat ECG Anomaly Score';
proc sgplot data=casuser.heartbeatV_anomaly_test_score(where=(timeID>=67500 and timeID<=68500));
    series x=timeID y=_Anomaly_;
run;

Although the patterns in the normal anomaly score and the anomalous anomaly score plots look different, they have very similar magnitude peaks, and thus we cannot choose a threshold on this anomaly score to distinguish between the normal heartbeat and the potentially dangerous V beat anomaly. Standardizing the data was not enough to enable the simple isolation forest model to identify the anomaly. We could potentially improve the isolation forest model by adding more time series variables, like moving averages, seasonal/cycle indicators, and even lagged values of the ECG reading. It’s also a good idea to explore some other time series anomaly detection approaches, so our next step is to try a distance-based method.

The SAS TSMODEL Procedure has a Time Series Motif Discovery Package with an object called MTFANOM that allows us to perform distance-based subsequence anomaly detection. We choose a subsequence length (LENGTH in the code) and a subsequence overlap value (OVERLAP in the code) and then break the time series apart into subsequences of the specified LENGTH to calculate the distance between each subsequence, excluding subsequences within the length of the specified OVERLAP. At the end of this process we have a bunch of calculated distances for each subsequence, and thus for each subsequence we can find the nearest subsequence based on distance. The subsequences with the largest distance between themselves and their nearest subsequence are identified as anomalies. There is a top K hyperparameter (TOPK) that allows us to specify how many possible anomalies we detect, it will return the top K subsequences with the greatest distance to their nearest subsequence. Ideally the kind of anomalies we want to detect will have the largest distances to the nearest subsequences, but these distance-based methods can be easily confused by abrupt changes in scale. Unlike the isolation forest described above, this distance-based anomaly detection method only works for univariate series. This method uses the Euclidean distance for comparing subsequences, so we don’t run into the computational slowdown associated with calculating dynamic time warping distances. This approach is very similar to the Matrix Profile method described in a previous post surveying time series anomaly detection approaches but doesn’t calculate and output the full matrix profile.

The LENGTH of the subsequences to analyze is the most important hyperparameter for this method, choosing a value that matches the length of the expected anomalies will dramatically improve the performance of this method. In many practical settings we are unlikely to know exactly what kinds of anomalies we are trying to detect, so it can be useful to run the method with a few different specified lengths and explore the results. For the heartbeat ECG data we are working with we know that the anomaly starts at time point 67950 and ends at time point 68200, so the anomaly is 250 observations. But even without this knowledge we could select a sensible value for the LENGTH hyperparameter by looking at the length of the normal heartbeats. In this case we are looking for health anomalies (rather than sensor anomalies, which could be a single time point of failure), so we expect the anomalies to be subsequence anomalies that last the duration of a single heartbeat. We can look at the first few cycles of the normal data (the plot earlier in the post titled “Normal Heartbeat ECG”) to see that the period of a heartbeat cycle is between 200 and 300 time points. We will try to perform subsequence anomaly detection using a LENGTH value of 200 and a LENGTH value of 300 to compare the results. Note that unlike with the isolation forest method, we are not really fitting a model to the training data and then looking for anomalies in the test data, instead we are just comparing distances between subsequences. We don’t split the data into training and test samples because we want to compare all available subsequences to one another, rather than only comparing training data to training data and test data to test data. In a deployment scenario, we would rerun the code on a window of relevant/recent data at a fixed interval as new data arrives, looking for anomalies at that fixed interval.

title 'Motif Based Subsequence Anomaly Detection';
proc tsmodel data=casuser.heartbeatV_anomaly outobj=(of=casuser.motif_anomaly_score(replace=YES));
    var ecg;
    id timeID interval=obs;
    require mtf;
    submit;
            declare object f(MTFANOM);
            declare object of(OUTMTFANOM);
            rc = f.Initialize();
            rc = f.setX(ecg);
            rc = f.setOption(
                            "LENGTH",200,
                            "NORMALIZE","NO",
                            "TOPK",5
                            );
            rc = f.run();
            rc = of.Collect(f);
    endsubmit;
run;

proc print data=casuser.motif_anomaly_score;
run;

We use the TSMODEL procedure and the MTF Package (Time Series Motif Discovery Package) to build a motif-based subsequence anomaly detection model with the MTFANOM object. We specify the time series variable of interest, ecg, and the time id variable, timeID, using an ‘obs’ interval (just each observation is a separate time point with no specific anchoring to real time) for the time series since it is sub-second data and we don’t need to use any intervals with default seasonality. In the submit block we provide the TSMODEL script code to create out MTFANOM object and set its properties. We set the X variable to be ecg, this is the time series in which we want to detect anomalies. We set the options for the anomaly detection method, looking for the top five (TOPK) anomalous subsequences of length 200 (LENGTH). We collect our results in the output dataset, an important but easy to forget part of working with PROC TSMODEL.

It took about 40 seconds to calculate all the subsequence distances for the anomaly detection method, which isn’t too long but reveals that this might not be a suitable algorithm for time series with millions of observations. The time it takes to perform the calculations primarily depends on the number of subsequences we must compare (since we must calculate a distance for each pair of subsequences), which is determined by a combination of the LENGTH hyperparameter (the length of the subsequences) and the overall length of the time series. Choosing a LENGTH value that is too low, especially when working with a long series, can dramatically increase the processing time for this algorithm. The interesting results live in the output dataset, so we print out the entire output dataset, motif_anomaly_score:

This table shows the top five most anomalous subsequences of length 200. It’s important to remember that these sequences have length 200 since we are only provided the starting time and the anomaly rank/distance. We do detect our PVC anomaly in the heartbeat, but it’s the fifth ranked anomaly on the list and we don’t detect it until time point 68025, which is 75 time points after it began at 67950. This is a consequence of choosing a length that is a bit shorter than the actual anomaly length, so we detect an anomaly from 68025 to 68225 instead of from 67950 to 68200. Let’s run this again using a LENGTH value of 300 and explicitly skipping the normalization of the time series (which was performed by default without us asking). Normalization can help if the time series dramatically changes scale, but it can also push the anomalies to be closer to the normal data, making it harder to separate them. It’s worth it to experiment with normalization, since it can both help and hurt in anomaly detection depending on both the nature of the time series and the nature of the anomalies.

title 'Motif Based Subsequence Anomaly Detection attempt 2';
proc tsmodel data=casuser.heartbeatV_anomaly outobj=(of=casuser.motif_anomaly_score(replace=YES));
    var ecg;
    id timeID interval=obs;
    require mtf;
    submit;
            declare object f(MTFANOM);
            declare object of(OUTMTFANOM);
            rc = f.Initialize();
            rc = f.setX(ecg);
            rc = f.setOption(
                            "LENGTH",300,
                            "NORMALIZE","NO",
                            "TOPK",5
                            );
            rc = f.run();
            rc = of.Collect(f);
    endsubmit;
run;

proc print data=casuser.motif_anomaly_score;
run;

This time we detect the anomaly at time point 67938, so 12 observations before it begins. Although this doesn’t matter too much for the heartbeat anomaly (the patient has likely had many PVC beats before getting their heartbeat monitored on an ECG), it can be useful to detect anomalies early when detecting faults and failures in machines. We also have ranked this anomaly higher up, it has an anomaly rank of 2, so only one subsequence of length 300 is further from its nearest neighbor than our anomalous subsequence. This is better than the previous performance, but still means we are detecting false positive anomalies with our algorithm (there is no ‘real’ anomaly at time point 11662 even though the algorithm has identified that there is an unusual subsequence at that time). We will get the best performance if we choose the length to be exactly equal to the length of the expected anomaly, so let’s try one more time using the value of 250 for LENGTH since that is the exact duration of the anomaly we hope to detect. Keep in mind that this is likely unrealistic since we don’t usually know exactly what kinds of anomalies we hope to detect in deployment scenarios.

Instead of using PROC TSMODEL a third time, let’s explore the very cool Subsequence Anomaly Detection SAS Step. This is an easy way to get started detecting anomalies on time series data and can be incorporated into SAS Studio Flows. We find this Step under Forecasting in the SAS Studio Step panel:

We must create a flow to use this Step, but we can double-click it, and SAS Studio will prompt us to create a new flow with this Step. We start with just the Subsequence Anomaly Detection Step, but we must specify some metadata settings in the Step and add Table nodes for the input and output data. It’s easy to add a table as an input node by selecting Add -> Table, specifying the input table in the Table Properties pane, and then connecting the input table to the Subsequence Anomaly Detection Step. It’s a bit trickier to add an output table since we will need to create an output port by right clicking the Subsequence Anomaly Detection Step and selecting Add output port:

With the output port added we can then add another Table node and specify an output table containing the detected anomalies. We must specify the Time series variable and the Time ID variable before running the flow, and we will want to change some hyperparameters for this algorithm.

The only options we change on this step are highlighted in the red box. We turn off the normalization and we set the Anomaly subsequence length to 250, the known length of the anomaly in the dataset. Note that the Step will automatically create an Anomaly subsequence plot, something that we would have had to do manually when using the TSMODEL Procedure (something we skipped). After running the Flow, we can explore the detected anomalies and the plotted results:

With the LENGTH hyperparameter set to the exact length of the known anomaly we identify the anomaly starting at time point 67946, so 4 time points before the label. We are detecting anomalies of length 250, so this is close to a full overlap between the detected and expected anomaly. Additionally, we see that this anomaly has a much larger distance value than any of the other detected anomalies in the top five most anomalous subsequences. This means that the PVC heartbeat anomaly has a larger distance to its nearest neighbor subsequence than any other subsequences of length 250 in the data, and by a large margin. In the Anomaly Subsequence Plot we can see that the orange highlighted anomaly (“Anomaly1At67946”) clearly reveals the V-beat in the ECG.

The main goal of this post was to introduce and compare some anomaly detection methods, highlighting that different methods will work better for different kinds of data. We also illustrate that some anomalies are much easier to detect than others, and it can be useful to explore and compare a few different approaches to anomaly detection. The isolation forest did a great job detecting the easy Asian Citrus Psyllid stylet moving anomaly in the EPG data but struggled to find the V-beat heartbeat anomaly in the ECG data. The motif-based subsequence anomaly detection method worked much better on the ECG data but required some massaging of hyperparameters to work exactly as we would like in detecting the anomaly. Finally, we showed how to build anomaly detection models using both PROC TSMODEL, and SAS Studio Flows/Steps. A lot of anomaly detection tools require coding, so SAS Studio provides a great resource to non-coders when it comes to performing basic anomaly detection on time series data.

References:

Find more articles from SAS Global Enablement and Learning here.

Comparing Anomaly Detection Algorithms for Detecting Challenging Anomalies in Time Series Data

Registration is open

SAS AI and Machine Learning Courses