Detecting Point Anomalies in Time Series Data using the Isolation Forest

The purpose of this post is to learn how to build an isolation forest to detect point or subsequence anomalies in a time series dataset using SAS Studio. This is a follow-up to a series of previous blog posts on time series anomaly detection. Previous posts focused on the basic concepts of time series anomaly detection and an overview of some different algorithmic approaches used to detect anomalies in time series data. This post will focus on a specific example, using SAS code to build an isolation forest to detect anomalies in the electrical penetration graph (EPG) calculated from an Asian Citrus Psyllid (a small insect that feeds on plants) as it inserts its stylet (its mouthpart) into different sections of a leaf.

Before we talk about the anomaly detection algorithm, let’s briefly explore this data, which came from the UCR Anomaly Detection archive. This is a curated archive of anomaly detection data, so this dataset is known to have a single anomaly, where the psyllid moves its stylet to a new vein in the leaf upon which it is feeding. This dataset contains 30,000 records, with the single anomaly occurring between observation 17210 and observation 17260. We start by loading the data into memory on the CAS server and then partitioning the data into training and test datasets. This is time series data, so the training data will be the first 5000 observations (with no anomalies we are training only on normal data), and the test data will be the remaining 25000 observations (which includes the one anomaly). This is designed to simulate how we would build a system to detect anomalies in manufacturing machines, where we have historical data for the ‘normal’ operating condition of a machine and want to build an algorithm that can detect anomalies that differ from this ‘normal’ condition.

cas;
caslib _all_ assign;

proc casutil;
	load file='/home/student/145_UCR_Anomaly_Lab2Cmac011215EPG1_5000_17210_17260.txt' 
	importoptions=(filetype="csv" getnames="false")
	casout="psyllid_anomaly" replace;
quit;

data casuser.psyllid_anomaly / single=yes;
	set casuser.psyllid_anomaly;
	timeID = _N_;
	rename Var1=epg;
run;

data casuser.psyllid_anomaly_train casuser.psyllid_anomaly_test;
	set casuser.psyllid_anomaly;
	if timeID <= 5000 then output casuser.psyllid_anomaly_train;
	else output casuser.psyllid_anomaly_test;
run;

Now we can use the SGPLOT Procedure to visualize a segment of normal data from the training data, and the segment with the anomaly in it from the test dataset. Of course, in real-life when building these anomaly detection algorithms, we may not have access to a known (labeled) anomaly, but it helps in building and testing these algorithms to have examples of known anomalies.

/*normal psyllid EPG data*/
title 'Normal Psyllid EPG';
proc sgplot data=casuser.psyllid_anomaly(obs=200);
	series x=timeID y=epg;
run;

/*single anomaly, moved stylet from 17210 to 17260*/
title 'Anomalous Psyllid EPG';
proc sgplot data=casuser.psyllid_anomaly(firstobs=17150 obs=17350);
	series x=timeID y=epg;
run;

The known anomaly between timeID 17210 and 17260 will be used to help us set an anomaly threshold score that identifies this anomaly in the data but classifies all other time points as normal.

Our next step is to build our anomaly detection model so that we can generate an anomaly score. In this case we will use an isolation forest, which is a variation of a random forest model that is designed specifically for anomaly detection. We will use the FOREST Procedure in SAS Studio, which is usually used to train normal random forest models but also has an option for producing an isolation forest.

An isolation forest is an ensemble of decision trees, each trained using the usual split-search algorithm, but instead of training the trees to classify points like we would with a binary target, we train the trees to isolate each individual data point. This is analogous to training a maximal decision tree, where we keep splitting leaves until each leaf node only has one observation. When doing binary classification this maximal tree would almost certainly overfit the training data, but that isn’t a problem for anomaly detection because we are not actually using the leaf nodes for classification. The anomaly score for any given observation is calculated based on the path length through the tree to get to the leaf node containing the observation. The basic intuition is that observations that are easier to isolate are more likely to be anomalies, and observations that are harder to isolate are more likely to be normal data. Thus, the anomaly score can be thought of as an inverse of the path length through the tree to get to the observation (the calculation is a bit more complex, but this is the basic idea). When this path length is short (i.e. a small number of splits to get to a leaf node) we have a higher value for the anomaly score, whereas when the path length is long (we must traverse many splits deep into the tree to get to a leaf node) then we will have a lower value for the anomaly score.

The anomaly score generated by an isolation forest model for observation x is calculated based on the path length from the root of each tree to the leaf node containing x. Since the model is an ensemble of trees, we calculate the average path length, h(x), over all of the trees in the forest. The anomaly score s(x) for observation x is then given by:

s(x) = 2^-h(x)

This anomaly score will always be between 0 and 1, with values closer to 1 being more likely to be an anomaly, and values below 0.5 being likely to be normal data.

We use the FOREST Procedure with the ISOLATION option to train an isolation forest on the psyllid stylet data. We will use the default hyperparameters in this model, but it’s important to note that the isolation forest has slightly different hyperparameters from the regular forest model. The most important difference is that the INBAGFRACTION (percentage of training observations to sample for each tree) and VARS_TO_TRY (number of variables to consider for each split) options don’t have any impact on the isolation forest, instead there is a SAMPLEN option (seen in the code below with the default value of 100) that determines the number of observations to sample for each tree, and only one variable is selected at random for each split.

proc forest data=casuser.psyllid_anomaly_train ISOLATION(SAMPLEN=100) seed=919;
	input epg / level=interval;
	id timeID;
	output out=casuser.psyllid_anomaly_train_score copyvars=(_ALL_);
	savestate rstore=casuser.isolation_forest_astore;
run;

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The isolation forest is trained on the first 5000 observations of the data (the normal training data), and we use the savestate statement to create an Analytic Store (ASTORE) model artifact that we can use to deploy the model. In this example we will use the ASTORE to score the test data, but in real-world scenarios we could use this ASTORE to deploy the model into a production environment. We also specify an output dataset, but this just the calculated anomaly scores on the training data, which we know contained no anomalies. We must provide an ID variable that uniquely identifies rows in the data, and since we are doing time series anomaly detection this will always be the time ID variable used to define the time series. Forests have plenty of other hyperparameters that can be tuned to adjust the performance, but for now we will accept the default values on all hyperparameters, which means we will end up training a forest with 100 trees. If anomaly detection algorithms are failing to capture known anomalies, or if they are classifying large segments of normal data as anomalous, changing model hyperparameters is a good starting point when trying trying to improve the model. One thing to note about isolation forests is that some random forest hyperparameters are ignored when training isolation forests, including the INBAGFRACTION (which determines what percentage of observations is used to train each tree) which is replaced with the SAMPLEN option on the ISOLATION statement (this determines how many samples are used to train each tree, and the default value is 100).

The isolation forest algorithm can include any number of input variables, but in this example we have a single time series of EPG data, so we just include a single interval input variable, the EPG signal. In this simple example we will just provide the one signal as our input, but adding more relevant variables can always help improve model performance. These relevant variables can either be contemporaneous time series (for example in this data it could be the body temperature of the psyllid insect, although this is something we didn’t collect), or they could be derived variables from the existing time series (lagged values of the EPG signal, or seasonal dummy variables). For simplicity in this example, we have skipped doing any kind of data preprocessing to create new variables, but it is a good idea to consider doing some time series feature extraction before building an isolation forest. Any time series feature that could be useful as an exogenous variable in building forecasts is likely to be useful in anomaly detection as well. A good place to start learning about creating these features is a SAS course on time series feature mining and creation, linked in the references.

proc astore;
	score data=casuser.psyllid_anomaly_test
		  rstore=casuser.isolation_forest_astore
		  out=casuser.psyllid_anomaly_test_score;
quit;

proc sgplot data=casuser.psyllid_anomaly_test_score;
	series x=timeID y=_Anomaly_;
run;

We use the ASTORE procedure to score the test data using the isolation forest model trained on the first 5000 observations. The output dataset contains a variable, _Anomaly_, with the anomaly score between 0 and 1 for each observation in the test data (this is s(x) from the discussion earlier). We can then plot the anomaly score for the test data to look for anomalies and to identify a threshold value for this anomaly score.

Most of the data has an anomaly score between 0.46 and 0.56, but there are many time points with an anomaly score between 0.55 and 0.6. Only a few time points have an anomaly score above 0.60, based on visually inspecting the results this would be a good starting point for an anomaly threshold. Of course, with this data we know there is a single anomaly in the entire test dataset, so we can inspect the anomaly scores for a section of normal data and compare it to the score for the known anomaly.

/*normal psyllid data*/
title 'Normal Psyllid EPG Anomaly Score';
proc sgplot data=casuser.psyllid_anomaly_test_score(where=(timeID<=5200)); series x=timeID y=_Anomaly_; run; /*single anomaly, moved stylet from 17210 to 17260*/ title 'Anomalous Psyllid EPG Anomaly Score'; proc sgplot data=casuser.psyllid_anomaly_test_score(where=(timeID>=17150 && timeID<=17350));
	series x=timeID y=_Anomaly_;
run;

Although the normal data has a few spikes that approach 0.60, the anomalous data has some time points with an anomaly score above 0.61. Knowing that we only have one anomaly in the test data, we can select an anomaly threshold value of 0.61. With this threshold the only anomaly detected by this model in the test data will be the one plotted above, which we know is between time points 17210 and 17260.

data anomaly;
	set casuser.psyllid_anomaly_test_score;
	where _Anomaly_ &gt;= 0.61;
run;
 
proc print data=anomaly;
run;

This model is not able to reliably detect the anomaly right when it begins (timeID 17210), but is able to detect the anomaly at time point 17234. This is a situation where range-based precision and recall could be used to evaluate model performance in a way that rewards models for predicting anomalies earlier in the anomalous subsequence, although in this case with a single anomaly the value of precision and recall is limited.

Real world data will likely have many anomalies, and we are unlikely to know exactly what they look like or when they occur. In many cases when building anomaly detection models, we only have access to normal data and we don’t have any examples of anomalies. This makes it more challenging to identify an effective anomaly threshold value and we must rely on choosing a threshold value to match an expected percentage of anomalous data. If we expect anomalies in the data 0.1% of time (maybe based on some machine specification or prior knowledge from the business) we could choose an anomaly threshold that selects the 0.1% of the test data with the largest values for the anomaly score as anomalies and classify the rest of the data as normal.

The main goal of this post was to illustrate the basic process of building an anomaly detection model for time series data. We used the isolation forest model because it is reasonably simple and easy to understand, especially since random forests are a well-known and popular machine learning algorithm. The Asian Citrus Psyllid anomaly dataset has a particularly easy anomaly to detect, so we didn’t have to do any time series feature engineering to detect it with the isolation forest. We also didn’t have to try multiple models or multiple hyperparameters for different models to identify a champion anomaly detection model. This is good practice in general, since different kinds of anomalies will be easier or harder to detect with different algorithms. We will explore identifying harder anomalies using a few different approaches in a future post, comparing how the different approaches do at generating anomaly scores that separate the normal data from the anomalous data.

References:

Previous Posts on Anomaly Detection for Sequence Data:
SAS Resources for Isolation Forests:
- SAS FOREST Procedure
  - SAS Forest Procedure ISOLATION Statement
- SAS Isolation Forest Details
- SAS Isolation Forest Example (not a time series example)
SAS Course - Time Series Feature Mining and Creation

Find more articles from SAS Global Enablement and Learning here.

Detecting Point Anomalies in Time Series Data using the Isolation Forest

Registration is open

SAS AI and Machine Learning Courses