The purpose of this post is to learn about different kinds of anomalies in time series data, and about the algorithms that can be used to detect them. This will be an introduction to the topic, so our focus will be on understanding the basic approach to anomaly detection used by the various algorithms, rather than the details of the algorithms themselves. Our goal is to understand different kinds of time series anomalies and to define time series anomaly detection evaluation metrics.
This introductory post will emphasize the general approach for time series anomaly detection. A second post in this series will explore diverse approaches to time series anomaly detection.
Anomalies in Time Series Data – Point, Subsequence, and Contextual Anomalies
In tabular data analysis an anomaly is a single observation (a row of data) with unusual or unexpected values for some or all the variables (columns) in the dataset. This definition is intentionally vague because anomalies are context-dependent, but the basic idea is that the anomaly doesn’t fit with the pattern of the “normal” data. This means that characterizing the pattern of “normal” (non-anomalous) data is an important task in any anomaly detection approach. A common approach to find statistical outliers in datasets is to look for points that are more than three standard deviations from the mean (a three-sigma rule) and declare these points as outliers. This approach is only sensible if the underlying data is approximately normally distributed, which represents our “characterization of the pattern of normal data” mentioned previously. Any attempt to define or detect anomalies will require assumptions about the “normal” data and how it is generated.
Time series data have additional structure beyond tabular data, and thus anomalies in time series data can be more complex than anomalies in tabular data. Time series data are ordered, with the time ID variable defining the time step and the ordering for the data. This means that collections of adjacent points in time series data can be used to define meaningful subsequences in the time series. Time series anomalies can be broadly categorized into the following three types of anomalies:
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
Evaluating Anomaly Detection Models – Range-Based Precision and Recall
Precision and recall are traditionally used to evaluate the effectiveness of anomaly detection algorithms, but they have limitations when working with time series data containing subsequence anomalies.
High values of precision and recall indicate an effective anomaly detection algorithm, but there is usually a tradeoff between precision and recall when working with real data. An anomaly detection model with high precision will mostly detect anomalies (it won’t detect a lot of fake anomalies), but it might miss a lot of real anomalies depending on the recall. A model with high recall will find most of the real anomalies in the data, but it might also detect a lot of fake anomalies depending on the precision. Many anomaly detection algorithms provide some kind of continuous anomaly score where a threshold can be set to detect anomalies. Changing this threshold can help find a good balance between precision and recall. Usually increasing the threshold will increase the precision (reduce the number of false positives), while decreasing the threshold will increase the recall (increase the number of anomalies detected).
The basic definitions of precision and recall don’t include any consideration of anomalies that span more than a single data point (subsequence anomalies). We can instead use range-based precision and recall which calculates overlap between real anomaly ranges and predicted anomaly ranges. The basic idea is to calculate precision/recall values for each ‘anomaly range’ and then add up these values across the series to get overall values for precision and recall. This approach balances existence (detecting any portion of the anomalous range) with size, position, and cardinality (detecting the right length and alignment of the anomalous range).
The details of calculating range-based precision and recall are a bit complex, so let’s look at an example of manually calculating recall from a toy anomaly detection model containing both point and subsequence anomalies. We start with a simulated dataset containing 3 point anomalies and 2 subsequence anomalies and we use a simple threshold value to detect anomalies in the data (this isn’t an effective algorithm for anomaly detection, but it will help us learn about calculating range-based recall).
/*import the simulated time series dataset with fake anomalies*/
proc import datafile="path_to_data/simulated_timeseries_with_anomalies.csv"
out=simTS
dbms=csv
replace;
run;
/*plot the data to inspect the anomalies*/
proc sgplot data=simTS;
series x=time y=value;
xaxis grid;
yaxis grid;
run;
/*note the following "True" anomalies in the data:
Point Anomalies at time=20, time=100, and time=150
Subsequence Anomalies at time=(60-70) and time=(170-180)*/
data toy_anomaly_detection;
set simTS;
detected_anomaly=0;
true_anomaly=0;
if value > 2 then detected_anomaly=1;
if value < -2 then detected_anomaly=1;
if (time=20 or time=100 or time=150) then true_anomaly=1;
if ((60 <= time <= 70) or (170 <= time <= 180)) then true_anomaly=1;
run;
title "Detected Anomalies";
proc sgplot data=toy_anomaly_detection;
styleattrs DATACOLORS=(verylightgrey lightred)
DATALINEPATTERNS=(solid dot);
block x=time block=detected_anomaly / transparency=0.75;
series x=time y=value;
xaxis grid;
yaxis grid;
run;
title "True Anomalies";
proc sgplot data=toy_anomaly_detection;
styleattrs DATACOLORS=(verylightgrey lightgreen)
DATALINEPATTERNS=(solid dot);
block x=time block=true_anomaly / transparency=0.75;
series x=time y=value;
xaxis grid;
yaxis grid;
run;
Evidently our anomaly detection algorithm does not capture all the ‘real’ anomalies in the data. Now we calculate range-based recall for this ‘algorithm’. Precision is calculated similarly but does not include a term for the existence reward and uses the definition of precision to calculate the overlap reward. Note that the SAS DATA Step code below is an inefficient way to calculate the range-based precision, but it illustrates the core concepts. We also make some simplifying assumptions, setting a cardinality factor to 1 and a bias factor to 1 (basically we ignore some tunable parameters that allow us to adjust range-based precision). For more details please see the paper on range-based precision and recall in the references.
/*now we calculate the range-based recall*/
/*first we identify detected vs real anomalies*/
data range_based_recall;
set toy_anomaly_detection;
if _N_ = 1 then anomaly_count = 0;
if true_anomaly = 0 then anomaly_id = 0;
if (true_anomaly=1 and lag(true_anomaly)=0) then anomaly_count+1;
if true_anomaly = 1 then anomaly_id = anomaly_count;
call symputx('num_anomalies',anomaly_count);
run;
/*next we calculate the assessment*/
data range_based_recall;
set range_based_recall;
/*recall includes rewards for existence and overlap*/
/*alpha determines the balance between existence and overalp in the calculation*/
alpha = 0.5;
existence_reward = 0;
overlap_set = 0;
retain max_overlap;
retain total_recall 0;
do i=1 to &num_anomalies;
if i=anomaly_id then do;
if detected_anomaly=true_anomaly then do;
existence_reward = 1;
overlap_set = anomaly_count;
end;
end;
end;
/*calculate overlap reward*/
if (anomaly_id ^=0 and lag(anomaly_id) = 0) then max_overlap = 1;
if (anomaly_id = 0 and lag(anomaly_id) ^= 0) then max_overlap = 0;
if (anomaly_id ^=0 and lag(anomaly_id) ^= 0) then max_overlap+1;
if (overlap_set ^=0 and lag(overlap_set) = 0) then overlap = 1;
if (overlap_set = 0 and lag(overlap_set) ^= 0) then overlap = 0;
if (overlap_set ^=0 and lag(overlap_set) ^= 0) then overlap+1;
if max_overlap ^= 0 then overlap_reward = overlap / max_overlap;
else overlap_reward = 0;
recall = alpha*existence_reward + (1-alpha)*overlap_reward;
run;
data range_based_recall;
set range_based_recall;
where anomaly_id ^= 0;
by anomaly_id;
if last.anomaly_id then do;
final_recall = recall;
end;
run;
proc sql;
create table recall as
select sum(final_recall) as total
from range_based_recall;
quit;
data recall;
set recall;
keep recall;
recall = total / &num_anomalies;
run;
proc print data=recall;
run;
This yields a range-based recall value of 0.7636. If you are interested in using range-based precision and recall to evaluate time series anomaly detection models you can also use the Python package “PRTS” (see references) to calculate it for you and avoid the manual calculation above.
Now that we have established what kinds of anomalies we are looking for and how we will judge the performance of the models we use to find them, our next step is to explore some time series anomaly detection algorithms. This will be the focus of the next post in this series, focused on detailing the different broad approaches used to detect anomalies in time series.
References:
Find more articles from SAS Global Enablement and Learning here.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.