Jump to content
  • Top 11 methods for Outlier Detection in Spotfire®


    Overview

    Mathematically, any observation far removed from the mass of data is classified as an outlier. In practice outliers could come from incorrect or inefficient data gathering, industrial machine malfunctions, fraud retail transactions etc. So it becomes essential to detect and isolate outliers to apply the corrective treatment. You can use Spotfire to smartly identify and label outliers in the following ways. By the way, this article did start out as "Top 10 methods", but the release of the Violin Plot Mod made it to number 1, so now we have 11!

    1. Use the Violin Plot Mod for Spotfire

    image.png.0a67182d2b00b93683e82d5c82a6ed03.png

    The Violin Plot Mod builds on Spotfire's own Box Plot (below) - allowing you to easily view and compare the distribution of any data along with key statistical measures shown by the included box plot and displayed in the statistics table. View the complete article: Violin Plot Mod for Spotfire for more details and for a link to download the mod!

    2. Use a box plot

    boxplot_0.png.210626e32276b5c8a1bcb49d3b42e76c.png

    Box and whisker plot (box plot) shows the relationship between a numerical y-variable and a grouping x-variable by using the five number summary - minimum, first quartile (Q1), median, third quartile (Q3), maximum. In addition to the above Spotfire provides lower adjacent value (LAV) and upper adjacent value (UAV) defined as follows

    LAV = Q1 - 1.5 * IQR

    UAV = Q3 + 1.5 * IQR

    Where IQR is the interquartile range. Any point falling outside of LAV and UAV are marked as outliers. The tooltip label includes additional information about the outlier which is different compared to all other data points in the plot.

    3. Configure other plots

    Other plots from Spotfire quick access menu that are commonly used to identify outliers:

    • Bar Chart in histogram configuration to identify univariate outliers
    • Scatter plot in QQ plot configuration to identify bivariate outliers in distributions
    • Combination plot in Pareto chart configuration to identify outliers based on cumulative value
    • Parallel Coordinate Plot (PCP) multivariate analysis for outlier detection

    4. Data Panel Histogram

    data_panel_histogram.png.c8b898e01b0b54cf77051b03a99cc744.png

    The column overview data panel for in-memory as well as in-database (external) data shows a histogram of distribution for numerical columns. 

    Users can also insert custom lines for isolating outliers in multimodal data. Consider the case of data from a standard normal distribution - about 5% of the data falls beyond two standard deviations and thus will be picked up as outliers by common statistical tests. But this is just the nature of the distribution that the points follow. For such cases Spotfire allows you the flexibility to insert lines from custom expressions without depending entirely on predefined methods of outlier detection.

    histogram_for_outlier_detection.png.5b5ad8d55095147a9888a5f01c1ad64d.png

    Fig. shows histogram with outliers identified as points beyond 2 std. deviations from mean

    5. Use Column Aggregation Functions

    column-aggregation.png.cf7b0b9467195015f746026af09a77de.png

    The y-variables for visualization types available in Spotfire can be aggregated to display Outlier Counts, Percent Outliers, Percentiles and Quartiles. These measures can be passed to configuration properties like color schemes described in point number 6 below to visually separate outliers from the rest of the data.

    6. Use TERR to detect outliers

    Custom expressions, Expression functions and Data functions, all allow the user to extend Spotfire capabilities by seamlessly integrating it with 10,000+ packages from CRAN using TERR or Open Source R. An example of combining the TERR expression with color could be to choose gradient color scheme based on outlier scores calculated by one line expression:

    outlier.score <- Rlof::lof(datacolumn, k=5)

    Here the Rlof package contains the lof function which is an implementation of widely used Local Outlier Factor algorithm to detect outliers. These scripts map Spotfire data elements (tables, columns, properties etc) to R function inputs and can be saved and reused across columns, visualization configurations etc. Such flexibility and extensibility in Spotfire is unmatched by any market contemporaries.

    For more extensive analysis like Mahalanobis distance analysis for Outlier Detection, TERR Data functions can be leveraged. Output from the data functions can be automatically plot onto interactive, brush-linked visualizations.

    7. Enable Color Scheme Rules

    colour_scheme_rules.png.f0797dd95d46e109c508bb90e2ad70e4.png

     

    Fig. shows all available color schemes and highlights out of box outlier color scheme

    Outliers can be smartly identified using dynamic outlier color schemes based on dynamic rules that the user can enable. These rules include:

    • Exclude Outlier color scheme in predefined color schemes
    • Simplest conditional inbuilt color options for points lesser than the Lower Inner Fence or greater than Upper Inner Fence
    • Threshold by mean, median, custom user specified expression
    • Use gradient color scheme with dynamic Outlier Scores created in TERR as above

    8. Leverage Curve Fit or Regression

    Lines and Curves in Spotfire visualization properties lets you insert a curve fit or a line fit to the data. This fit can then be used to identify extreme deviate points i.e. outliers!

    9. Similarity or Clustering

    Spotfire provides out of the box functionality to apply Line Similarity and K-Means clustering to visualizations from the Tools menu. The user can choose the similarity metric - Euclidean or Correlation and other parameters like number of clusters to create line similarity or Clustering label column in the data. This column can then be used to color or trellis options.

    Stable number of clusters can be found by applying Hierarchical clustering on the data. Hierarchical clustering is also available from the Tools menu in Spotfire and results in heat map visualization with dendrogram based on distance metric. Sliding the cutoff point to desired position in the dendrogram helps decide stable number of clusters.

    clustering.png.a85807ed8230514899facdb331af0a11.png

     

    Fig. shows out of box Kmeans clustering on data. Grey line in Empty cluster is outlier.

    If the data has outliers they will fall into their own cluster, for number of clusters greater than the stable number.

    10. Explore advanced Configurations

    We discussed creating new calculations and columns with expressions, expression functions and data functions. These can be connected to configuration options that automatically label outliers. Advanced configurations from visualization properties extend beyond the color feature and can be applied similarly to markings, filters, subsets and labels across visualizations.

    11. Templates on Community Exchange

    To aid the Citizen Data Scientist, Spotfire Data Science team makes available for free several plug-and-play templates on community.spotfire.com/extensions under the "Analytics" tag. These templates allow the user to  plug in their data with the push of a button and explore the insights with minimal configuration.

    Anomaly detection using deep learning neural nets is one such template that analyses the input data to find anomalies based on recreation error during unsupervised learning. Another domain specific use case is the Statistical Process Control Template for Spotfire using Spotfire Data Science Workbench for  which identifies violations or outliers from the established control limits of the individual points, the moving average and variance. These templates allow the user to extend the definition from a common outlier to a domain specific outlier and smartly identify and label the same.

    How do I learn more?

    This summarizes briefly top 10 methods for outlier detection. Watch the page and vote up to get notified about detailed updates. You could also request a featured session on any specific method from above on Dr. Spotfire by:

    See Also

    1. Adjacent Values and Outliers
    2. Custom expressions and Expression Functions in Spotfire
    3. How to configure Pareto chart from Combination chart in Spotfire
    4. Shewhart Control Charts & Trend Charts in Spotfire
    5. Clustering made simple with Spotfire
    6. Anomaly Detection - Technology and Applications 

     


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...