Anomaly Detection - Technology and Applications - Spotfire

INTRODUCTION

What are Anomalies?

Anomaly detection is a way of detecting abnormal behavior. One definition of anomalies is data points that do not conform to an expected pattern compared to the other items in the data set. Anomalies are from a different distribution than other items in the dataset. Anomalies in data translate to significant (and often critical) actionable information in a wide variety of application domains. The figure below shows a simple example of anomalies (o1, o2, O3) in a 2D dataset. The autoencoder technique described here first uses machine learning models to specify expected behavior and then monitors new data to match and highlight unexpected behavior:

(Anomalies are similar, but not identical, to outliers. Outliers are points with a low probability of occurrence within a given data set. They are observation points that are distant from other observations. However, they don't necessarily represent abnormal behavior. Outliers in data warrant attention because they can distort predictions and affect model accuracy if you don't detect and handle them. For more information on detecting outliers in Spotfire, see this article: Top 11 methods for Outlier Detection)

Overview Webinars

Webinar: Real-Time Anomaly Detection, Slides - Data Science Central
Webinar: GeoSpatial Anomaly Detection

Overview Whitepapers and Solution Briefs

INDUSTRY USE CASES

Here are a few examples from our practice:

Baseball

Baseball is one of the oldest sports in the United States, with a history dating back to the 19th century. Since 1880, there have been 101 different teams that have played a grand total of 2,829 different seasons. By looking at the data, we wanted to statistically uncover which of these 2,829 seasons were anomalies, and which teams had seasons unlike any other. To accomplish this, we utilized a method called SAX (Symbolic Aggregate Approximation) encoding. The advantage of using SAX is that it is able to act as a dimensionality reduction tool, it is tolerant of time series of different lengths, and it makes trends easier to find. For details, see this blog: Using Time Series Encodings to Discover Baseball History's Most Interesting Seasons

Preventing Machine Breakdowns with Connected Sensor Data

Many different types of equipment, vehicles, and machines are now instrumented with sensors. Monitoring these sensor outputs can be crucial to detecting and preventing breakdowns and disruptions. Unsupervised learning algorithms like Autoencoders can be used to detect anomalous data signatures that may predict impending problems. When sensor time series traces exhibit repeating patterns, special techniques, such as MASS (download this document from Resources below), or the one used in the Sensor Anomaly Detection at the Edge solution on this page, (shown in the image below) can be used.

Listening for Abnormalities in the Sounds Machines Make

A good mechanic can tell whether your car is OK - or not - by listening to the sounds it makes. A really good one can tell you what is wrong with it.

Abnormal sounds can be an indicator that a machine needs maintenance. The video below shows an example of an application that uses audio data from any device and learns to identify anomalous sounds made by machines. Datasets of known abnormalities can then be created and the models can be deployed for real-time scoring.

Identifying Abnormal Product

Many manufactured products undergo some form of testing to determine suitability for use. Univariate and linear multivariate Statistical Process Control methods can be used to detect anomalous products based on this data. However, with increasing component and system complexity, multivariate anomalies that also involve significant interactions and nonlinearities may be missed by these more traditional methods. These anomalies can be implicated in reliability and system failures. AI-based algorithms, such as autoencoders, can often be used to identify these complex anomalies. Once the anomalies are detected, their fingerprints can be generated so they can be classified and clustered, enabling investigation of the causes of the clusters. As new data streams in, it can be scored in real-time to identify new anomalies, assign them to clusters, and respond to mitigate potential problems.

Defects and Abnormalities in Images

Connected digital cameras today capture large amounts of raw image data. People are very good at rapidly identifying abnormalities in images. However, it is expensive and time-consuming for humans to extract critical information from large numbers of images; they often remain unprocessed. AI algorithms are increasingly used to automate this process. These use cases often involve some combination of unsupervised learning (where similar images are clustered together), human verification that images contain abnormalities, and supervised learning, to train models that automate the identification of abnormalities of interest. Examples include the identification of cancer cells and manufacturing defects in images. An example of how this is done with semiconductor wafermap spatial test and fail patterns can be found here.

Cyber Threat Detection

Networked computers today are under constant threat of ransomware and other forms of cyber-attack. System Threats can be detected through analysis of computer log data, utilizing unsupervised learning models such as LSTM autoencoders for anomaly detection. LSTM autoencoders identify anomalies in the sequence of log events.

Bank Stress Test

Economic and performance data can be used for "stress testing" the capital reserves of bank holding companies to identify data anomalies. Details of one implementation can be found here:

Data Quality Management and Anomaly Detection - A Bank Stress Test Use Case

Fighting Financial Crime

In the financial world, trillions of dollars worth of transactions happen every minute. Identifying suspicious ones in real-time can provide organizations with the necessary competitive edge in the market. Over the last few years, leading financial companies have increasingly adopted big data analytics to identify abnormal transactions, clients, suppliers, or other players. Machine Learning models are used extensively to make predictions that are more accurate. Learn about and download the Risk Management Accelerator

Healthcare claims fraud

Insurance fraud is a common occurrence in the healthcare industry. It is vital for insurance companies to identify claims that are fraudulent and ensure that no payout is made for those claims. An economist recently published an article that estimated $98 Billion as the cost of insurance fraud and the expenses involved in fighting it. This amount would account for around 10% of annual Medicare & Medicaid spending. In the past few years, many companies have invested heavily in big data analytics to build supervised, unsupervised, and semi-supervised models to predict insurance fraud. Learn about and download the Risk Investigation App.

TECHNIQUES for Anomaly Detection

Companies around the world have used many different techniques to fight fraud in their markets. While the below list is not comprehensive, three anomaly detection techniques have been popular.

Visual Discovery

Anomaly detection can also be accomplished through visual discovery. In this process, a team of data analysts/business analysts, etc. build bar charts; scatter plots, etc. to find unexpected behavior in their business. This technique often requires prior business knowledge in the industry of operation and a lot of creative thinking to use the right visualizations to find the answers.

Supervised Learning

Supervised Learning is an improvement over visual discovery. In this technique, persons with business knowledge in a particular industry label a set of data points as normal or anomalous. An analyst then uses this labeled data to build machine learning models that will be able to predict anomalies on unlabeled new data.

Unsupervised Learning

Another technique that is very effective is Unsupervised learning. In this technique, unlabeled data is used to build unsupervised machine learning models. These models are then used to predict new data. Since the model is tailored to fit normal data, the small number of data points that are anomalies stand out. Some examples of unsupervised learning algorithms are:

Autoencoders

Unsupervised neural networks or auto encoders are used to replicate the input dataset by restricting the number of hidden layers in a neural network. A reconstruction error is generated upon prediction. Higher the reconstruction error, the higher the possibility of that data point being an anomaly.

Clustering

In this technique, the analyst attempts to classify each data point into one of many pre-defined clusters by minimizing the within cluster variance. Models such as K-means clustering, K-nearest neighbors, etc. are used for this purpose. A K-means or a KNN model serves the purpose effectively since they assign a separate cluster for all those data points that do not look similar to normal data.

One-class support vector machine

In a support vector machine, the effort is to find a hyperplane that best divides a set of labeled data into two classes. For this purpose, the distance between the two nearest data points that lie on either side of the hyperplane is maximized. For anomaly detection, a One-class support vector machine is used and those data points that lie much farther away than the rest of the data are considered anomalies.

Time Series techniques

Anomalies can also be detected through time series analytics by building models that capture the trend, repeated patterns (such as seasonality, machine cycles), and levels in time series data. Here is an introduction to the Detection of Anomalies in Repeating Time Series using the MASS algorithm. It includes a Spotfire example.

Download this document from the Resources below: matrix_profiles_and_mass_v2.pdf

A Design Pattern for Human-Centered Anomaly Detection and Classification

For many applications, it is not enough to determine that an item is an anomaly, but is also important to know how it is anomalous. It is important to enable the subject matter expert (SME) to remain in control throughout this process. Aided by AI, they use their knowledge of the business to help determine how anomalies will be classified and how accurate the models will be. Human Centered AI (HCAI) provides a framework for balancing computer automation and human control. Here is a Design Pattern that we use for generating anomaly detection models consistent with HCAI principles. It achieves this by using a combination of Visual Discovery, Supervised, and Unsupervised learning techniques.

Detect anomalies
Determine a unique 'fingerprint' for each anomaly
Cluster anomalies together with similar fingerprints
- SME refines the assignment of items to clusters to determine the Classes of practical significance for the use case
Train supervised learning model for each Class of interest
- SME reviews false positives and false negatives and refines the model until it achieves the desired accuracy
Deploy supervised learning models to Classify new items that belong to each class of interest.
Monitor model health and re-train if accuracy degrades or new classes of anomalies are detected. This process can be automated or guided by the SME.

This design pattern is used in the Spotfire Anomaly Detection template and our Wafermap Pattern Recognition solution.

Autoencoders Explained

Autoencoders use unsupervised neural networks that are both similar to and different from a traditional feed-forward neural network. It is similar in that it uses the same principles (i.e. Backpropagation) to build a model. It is different in that, it does not use a labeled dataset containing a target variable for building the model. An unsupervised neural network also known as an Auto encoder uses the training dataset and attempts to replicate the output dataset by restricting the hidden layers/nodes.

The focus of this model is to learn an identity function or an approximation of it that would allow it to predict an output that is similar to the input. The identity function achieves this by placing restrictions on the number of hidden units in the data. For example, if we have 10 columns in a dataset (L1 in the above diagram) and only five hidden units (L2 above), the neural network is forced to learn a more restricted representation of the input. By limiting the hidden units, we can force the model to learn a pattern in the data if there indeed exists one.

Not restricting the number of hidden units and instead specifying a 'sparsity' constraint on the neural network can also find an interesting structure.

Each of the hidden units can be either active or inactive and an activation function such as 'tanh' or 'Rectifier' can be applied to the input at these hidden units to change their state.

Some forms of auto encoders are as follows:

Under complete Auto encoders
Regularized Auto encoders
Representational Power, Layer Size, and Depth
Stochastic Encoders and Decoders
Denoising Auto encoders

A detailed explanation of each of these types of auto encoders is available here.

SPOTFIRE SOLUTIONS

Spotfire Anomaly Detection Template - Autoencoders using TensorFlow

This template uses an autoencoder machine learning model to specify expected behavior and then monitors new data to match and highlight unexpected behavior. It features automated machine learning to optimize model-tuning parameters. The Time Series release includes time series analysis, so it can be used as a form of 'control chart', and has an input component drill-down to find the most important features influencing a reconstruction error and clustering analysis to the group and analyze similar groups of anomalies. Download the template from the Spotfire Exchange. See the documentation in the download distribution for details on how to use this template.

. A Deep Learning Autoencoders method is deployed using a Python Data FunctionSee this page for more information on how to build a good autoencoder model that will generalize to new datasets

Click on the image below to see a demo of the Autoencoder deployed to our Hi Tech Manufacturing Accelerator for real-time monitoring:

Autoencoder Model deployed for real-time monitoring

Spotfire Python Data Function - Autoencoder using TensorFlow

Spotfire allows for inbuilt Python and R data functions. An autoencoder is a versatile deep learning model that is used in multivariate regression, anomaly detection, and dimension reduction. This implementation uses TensorFlow with the Keras API; both are popular Python deep learning libraries. The data function allows a user to configure different datasets, configure different neural network architectures, train and save the neural network model, and score new data using the trained models. The Spotfire DXP includes further analysis of model features contributing toward reconstruction errors and uses reconstruction errors to find a statistical golden batch of data. More information on this asset is available here. It can be downloaded from the Spotfire Exchange here.

Isolation Forest Python Data Function for Spotfire

Isolation Forests are known to be powerful, cost-efficient models for anomaly detection. They isolate anomalies using binary trees and work well in high-dimensional problems that have a large number of irrelevant attributes, and in situations where the training set does not contain any anomalies. This data function will train and execute an Isolation Forest machine learning model on a given input dataset. It can be downloaded from the Spotfire Community Exchange here.

Local Outlier Factor Python Data Function for Spotfire

This data function uses the unsupervised local outlier factor method to perform anomaly detection on a dataset. The local outlier factor is based on the concept of local density. By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density and points that have a substantially lower density than their neighbors are considered to be outliers. The data function can be downloaded from the Spotfire Community Exchange here.

Autoencoder with AWS Sagemaker using Team Studio

Autoencoders are deep learning models that can be efficiently designed and trained using Cloud Services. This Team Studio workflow takes Sensor data, performs data preprocessing, stores the data and trains a model in S3, and outputs model results into Spotfire and other data sources. It uses AWS CLI, Boto3 Python SDK, and Sagemaker Python SDK to access AWS resources via Python notebooks.

Risk Management Accelerator

The Spotfire Risk Management Accelerator identifies potentially risky activities, such as financial crime or insurance fraud, in a high-frequency event stream using machine learning. Supervised and/or unsupervised models can be built and hot deployed to the streaming event processing platform, where events are scored events in real time. Alerts are then raised when potentially risky behavior is detected.

Risk Investigation App

Risk Investigation App identifies anomalous and suspicious transactions. It includes a case management framework that strengthens collaboration across the enterprise. With a centralized view of the investigation process, this highly customizable application provides an analytics-based framework with clear lines of accountability.

Hot Paths to Anomaly Detection: Sensor data on the event stream can be voluminous. In NAND manufacturing, there are millions of columns of data that represent many measured and virtual metrics. These sensor data can arrive with considerable velocity. In this session, learn about developing cross-sectional and longitudinal analyses for anomaly detection and yield optimization using deep learning methods, as well as super-fast subsequence signature search on accumulated time-series data and methods for handling very wide data in Apache Spark on Amazon EMR. The trained models are developed in Spotfire Data Science and Amazon SageMaker and applied to event streams using services such as Amazon Kinesis to identify hot paths to anomaly detection. This presentation is brought to you by Spotfire, an APN Partner.

AI and Data Science Innovation with Amazon SageMaker. Spotfire products can interact with the data on the cloud and build any type of neural network using TensorFlow. Specifically, Spotfire Data Science working with cloud resources like AWS allows users to build unsupervised neural networks for anomaly detection on data of any size. In this example, we use AWS products (s3, EMR, Redshift, and Sagemaker) to build an autoencoder using multiple nodes in a cluster. Spotfire brings real-time AI to business challenges with the Connected Intelligence Cloud. In this session, we show real-time AI in action; utilizing Amazon SageMaker, Connected Intelligence Cloud, and open source with at-scale, in-database compute; visual composition and notebooks; Slack-style collaboration among users; and model lifecycle deployment via low-code tooling such as Live Apps. We include case studies in equipment surveillance, dynamic pricing, risk management, route optimization, and customer engagement. Here are the slides

AWS ML Marketplace

TIBCO Data Science for AWS
TIBCO Autoencoder algorithm on AWS
Data Science Competencies - TIBCO is the only partner with dual competency in Data Services and ML Platform

Microsoft Collaboration - Sensor Anomaly Detection at the Edge

In collaboration with Microsoft, we have developed a containerized solution for Anomaly Detection. The Spotfire anomaly detection solution includes Microsoft Cognitive Services container deployment with anomaly detection, text mining, and root cause analysis.

Watch a presentation and demo of this solution:

For a description of the solution see this article - Anomaly Detection and Root cause analysis using Spotfire Analytics and Microsoft Cognitive Services
Microsoft Tech Community Blog - written by MicrosoftPress

Business News around Spotfire presence at MSFT Build:

Statistical Process Control

Control charts are widely used in Manufacturing, Energy, Telco, Technology, and many other sectors. They are a form of anomaly detection used to monitor key metrics, detect deviations from the baseline, and generate automated alerts. Spotfire supports many types of Shewhart (univariate) and multivariate charts; integrated limits generation, storage and deployment; selection of rules to detect out-of-control points; tagging and annotation; management and operations dashboards; periodic or real-time alerts; process capability studies and root cause drill-downs. More details about Spotfire SPC solutions can be found here:

Process Control & Anomaly Detection section on the Manufacturing Solutions page.

REFERENCES

Anomaly Detection - Technology and Applications