Anomaly Detection in Sensor Data using Support Vector Data Description and Time Series Features

The purpose of this post is to learn how to detect anomalous time periods in continuous sensor data using support vector data description (SVDD). We learn how to extract time series features from selected windows in the data and then we use these extracted features as inputs to a model to detect anomalies in the sensor data. In this post we focus on using the unsupervised SVDD algorithm to detect anomalies in the sensor data, but if we had labels indicating when anomalies or failures occurred in the devices monitored by the sensor, we could also use the extracted features as inputs for supervised learning algorithm. We will use the open-source Python package tsfresh to extract the time series features, illustrating SAS Viya integration with Python. Time series features can also be extracted in SAS Viya using the TSMODEL procedure, and a SAS paper with the details is included as a reference. This post will focus on extracting the features and fitting the SVDD model, while a subsequent post will cover deploying this approach using SAS Event Stream Processing.

Extracting Time Series Features from Sensor Data using Python and TSFRESH

Traditional machine learning methods and unsupervised approaches like SVDD both require tabular data with a collection of rows corresponding to different observations and columns corresponding to features associated with each observation. This makes it challenging to apply these methods to sensor or signal data, because this data is usually formatted as a time series with a single measured value at each time point (of course we can also have series with

multiple measured values at each time point). One approach to convert time series data into something more like tabular data is to extract a collection of numeric features from small windows of the time series data. This is somewhat analogous to applying the short-time Fourier Transform to the signal data, but instead of extracting frequency components in each window, we are extracting a vector of numeric features like the max value of the series in the window or the average of the series. In fact, some of the features we extract from the windows are FFT coefficients, so really, we are extracting both time-domain information and frequency-domain information from the window of data. All these extracted features will be used as inputs for a SVDD model to detect anomalous windows of sensor data.

We will use the open-source Python package tsfresh to extract these features, so we analyze our data in Python and connect to the SAS Viya CAS Server to build the SVDD model. We start by importing necessary Python packages and loading the data:

import swat
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from tsfresh import extract_features

turbine_data = pd.read_csv('turbine.csv')
turbine_data = turbine_data.drop(['Turbine1','Turbine2','Turbine3'], axis=1)
turbine_data.head()

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The dataset we are using in this example is simulated wind turbine data (a link to the data is available in the references), with each column corresponding to the energy generated by each turbine per hour in Kilowatts. We focus our analysis on Turbine 4 because it is known to experience an anomaly near the end of the measurement period.

sns.lineplot(turbine_data, x='time', y='Turbine4');
plt.xlabel('time index')
plt.ylabel('Turbine 4 Energy Produced Per Hour (kW)');

Visualizing the time series data, we can see that the output seems to decrease toward the end of the measurement period. This is the anomaly we want to detect using SVDD, and although the anomaly is easy to detect using PROC EYEBALL in this example (we can see a decrease in energy output towards the end of the series), using a model-based approach like SVDD to detect anomalies is a more scalable approach. Additionally, SVDD has the capability to detect more subtle anomalies in the data that might not be obvious when plotting in the time domain. It’s useful to have examples of real historical anomalies in the data in order validate automated detection systems.

We want to extract features from a collection of windows applied to the signal data, so we first must split the data into windows. In this example we use a window size of 50 time points, and we use “jumping” windows instead of “sliding” windows, which means the second window will begin on the next time point after the end of the first window. A sliding window (with a step size of 1) would mean that the first window begins on the first time point and covers 50 time points, while the second window begins on the second time point and covers 50 time points, 49 of which overlap with the first window. Using sliding windows will cause us to have more observations for training the SVDD model, but will also mean that the windows share a lot of observations, so anomalous time points may appear in multiple windows.

turbine_data["Window_ID"] = np.ceil((turbine_data['time'])/50)

This Python syntax defines a column (named “Window_ID”) that identifies the first 50 observations as part of the first window, the next 50 observations as part of the second window, and so on throughout the 900 time points in the data. The CEIL function is used to create an integer-valued "Window_ID" by diving the time variable by 50 and rounding up to the nearest integer. On this small dataset, the process yields 18 windows in total with 50 time points each. Real signal data will often have more observations and thus will require more windows. We can also choose larger or smaller window sizes depending on how much data we have. Smaller windows will yield better resolution in the time domain when it comes to detecting anomalous time points, while larger windows will provide more robust extracted time series features that better represent the data in the window.

turbine_features = extract_features(turbine_data, column_id="Window_ID", column_sort="time")

We use the extract_features() method in the tsfresh Python package to extract 783 time series features from each window of data. The column_id option specifies that we want to extract separate features for each value of the ‘Window_ID’ column, while the column_sort option specifies that within each window values should be ordered based on the ‘time’ column.

The turbine_features dataframe contains 18 rows, corresponding to each of the 18 windows we created. It also contains 783 columns, one for each time series feature extracted by the tsfresh package. With so many features it can be a good idea to select the most useful features as inputs to anomaly detection models. Although supervised variable selection is only possible with a labeled target, unsupervised variable selection approaches, or even just using subject matter expertise about the features and the system of interest, can be useful in reducing the overall number of time series features. We will stick with using as many of them as possible in this demo, but with larger datasets using all the extracted features can incur unwanted computational costs associated with storing all these values for each of the windows. Many of these features have NaN values since they cannot all be calculated on the short windows we provided, so we will filter the columns to eliminate any NaN values.

turbine_features.replace([np.inf, -np.inf], np.nan, inplace=True)
turbine_features.dropna(axis=1, inplace=True)

#keep the Window_ID as a variable in the dataframe
turbine_features.reset_index(inplace=True)

#change column names to conform to ESP field name requirements (preparation for deployment)
new_names = []
new_names.append('Window_ID')

for i in range(1, len(list(turbine_features))):
    new_names.append(list(turbine_features)[i].replace('"','')
                                              .replace(' ','_')
                                              .replace(',','_')
                                              .replace('.','pt')
                                              .replace('(','')
                                              .replace(')','')
                                              .replace('-','minus'))

turbine_features.columns = new_names

The code above removes the NaN and infinite values from the dataframe (by dropping any columns with NaN or infinite values); this leaves 470 features in the dataframe. It also stores the ‘Window_ID’ as a column instead of a Pandas index, and it cleans up the variable names by removing unusual characters from the extracted features. This is a step that will be necessary in the future if we try to deploy the model using SAS Event Stream Processing (which has strict requirements for field names). We can now use these extracted features as inputs to an unsupervised anomaly detection model.

Building Support Vector Data Description Models from Extracted Features using SAS Viya

So far, all our analysis has been in Python using the pandas and tsfresh packages. Now we want to upload our data to the SAS Viya Cloud Analytics Services (CAS) server so we can use SAS to build a SVDD model. We start by using the swat Python package to connect to CAS, and then we upload the ‘turbine_features’ dataframe into memory on the CAS server.

conn = swat.CAS("server.demo.sas.com", 30570, 'student', 'Metadata0')
cas_turbine_features = conn.upload_frame(turbine_features, casout=dict(name='turbine_features', replace=True))

We use the svddTrain CAS Action to build a SVDD model using the in-memory turbine_features data.

conn.loadactionset('svDataDescription')
conn.svDataDescription.svddTrain(
    table='turbine_features',
    inputs=list(cas_turbine_features)[1:],
    id='Window_ID',
    seed=137,
    tuneMethod="MEAN",
    solver="ACTSET",
    savestate=dict(name='SVDD_ASTORE', replace=True),
    fraction=0.05
)

The inputs are all of the time series features we extracted from the data, and we have a row of features for each window of time series data, so the ‘Window_ID’ variable uniquely identifies the observations used to train the SVDD model. Rather than manually specifying the SVDD Gaussian kernel bandwidth, we let the software choose it for us, represented by selecting tuneMethod=”MEAN” in the code. The outlier fraction is set at 0.05, because we expect at least 1 anomalous window of the 18 windows we created from the original data. The choice of bandwidth and outlier fraction will affect the number of outliers detected and the shape of the boundary that surrounds the normal data. We save an analytic store (ASTORE) after training the model. We will use the ASTORE for scoring and for future deployment in SAS Event Stream Processing.

The RBF Kernel Bandwidth is the value selected by the tuner, while the Threshold R Square value determines the threshold for detecting anomalies. If the _SVDDDISTANCE_ value is greater than the Threshold R Square Value we classify the observation as an anomaly. Real-world problems will involve more data and thus more observations and a longer training and tuning time for the SVDD model.

conn.loadactionset('astore')
conn.astore.score(
    table='turbine_features',
    rstore='SVDD_ASTORE',
    casout=dict(name='svdd_scored', replace=True)
)

We use the ASTORE generated during training to score the data (Note that in this example we are skipping the important step of splitting our data into training and validation to make sure we build a model that is useful on new data) and see what anomalies are detected.

display(conn.CASTable('svdd_scored').head(20))
conn.CASTable('svdd_scored')['_SVDDSCORE_'].value_counts()

The model identifies one anomaly in the time series, corresponding to the final window, window number 18. We don’t know exactly when the anomaly occurs in the time series, but we know there is an anomaly in the data between time points 850 and 900. The next step in the process would be to deploy this model on streaming data, so we download the ASTORE as a file on the operating system for deploying using SAS Event Stream Processing.

svdd_astore = conn.astore.download(rstore='SVDD_ASTORE')

with open('turbine_svdd_astore.sasast','wb') as file:
    file.write(svdd_astore['blob'])

Deploying this model using SAS Event Stream Processing will be a mild challenge because we will need to deploy the feature extraction to prepare the data in addition to the SVDD model. This will require using SAS Event Stream Processing functionality to run Python code against the streaming data. A future post will extend this example by illustrating how to deploy this model in SAS Event Stream Processing.

This example uses an unsupervised SVDD algorithm to identify anomalies in the signal data, but if we had any historical information about when anomalies occurred (i.e., if we had a labeled target) we could just as easily fit a supervised machine learning method like gradient boosting to detect anomalies using the extracted time series features.

References:

Find more articles from SAS Global Enablement and Learning here.

Anomaly Detection in Sensor Data using Support Vector Data Description and Time Series Features

Register Today!

Free course: Data Literacy Essentials

Get Started