Data Quality Management and Anomaly Detection

This article illustrates common methods for blending, transforming, and cleansing data, and identifying anomalies, using Spotfire Data Science - Team Studio. The data is a simulated set of economic and performance data for a bank, such as might be used for "stress testing" the capital reserves of bank holding companies in the U.S..

Overview

This article illustrates common methods for blending, transforming, and cleansing data, and identifying anomalies. The data is a simulated set of economic and performance data for a bank, such as might be used for "stress testing" the capital reserves of bank holding companies in the U.S.. The output from these workflows is both a cleansed dataset and a set of reports that provide metrics of data quality. In another article, Stress-Test Financial Modeling, we use these outputs to build a predictive model and a Touchpoint for what-if scenarios related to financial stress-testing.

Data Requirements

The inputs are quarterly balance sheet data for several different business units of a bank. The dependent variables measure non-interest expenditures and capital reserves. Features represent major economic indicators (e.g. U.S. GDP) and bank profile data (e.g. number of employees in the bank, aggregated loan amounts).

Data Discovery and Profiling

First template workflow, for reviewing the structure of the data, distributions of each field, and interesting correlations.

The data has almost 1,000 economic metrics, many of which may contain sparse or erroneous or irrelevant information. In this first workflow we review the quality of the data generally. We generate Summary Statistics for all the input fields, noting that many fields contain an overwhelming number of null values, and that some contain unexpectedly high, low, or negative values. We review those variables that seem most correlated to the dependent variable (which in this case is a measure of expenditures, but might also be a measure of total reserves). Applying a Row Filter to the Correlation output, we can determine a list of about 250 variables that seem most likely to be predictive of the dependent variable. We might also choose to retain variables that are generally known to have predictive power. And in a more advanced analysis, we might apply standard power transformations to find more subtle correlations.

Key Technique - Summary Statistics for Data Discovery

The Summary Statistics operator will count the number of empty or missing values in each field, and also provide information about the distribution of data, including medians, quartiles, and most common values. This information may then be used to identify variables that need to be excluded from analysis or that need to be cleansed by removing outliers or filling in missing values.

Data Cleansing and Integration

Second template, for cleansing data, filling in blanks, and removing unnecessary rows and columns

During the data discovery phase, we found that many variables and rows contain a large number of rows, and that several variables contained unexpected values. In this workflow, we attempt to clean the data without removing information with valuable business information or predictive power.

First, we use Summary Statistics to identify the fields that contain a large percentage of empty, missing or null values, which can then be removed with a Column Filter. For the remaining variables, we mostly choose to replace missing values with zero, or a simple average. But in the case of the dependent variable, we choose to use the average value for the corresponding year. (We could also use a Window function to use something like a running average.)

Next, we select those variables for which there appeared to be unexpectedly high or low values, and we use Replace Outliers to trim the bottom and top percentiles.

However, we still have over 1,000 columns remaining, and this may lead to problems with predictive models (e.g. over-fitting, or collinearity between variables leading to meaningless coefficients). We decide to further reduce the dataset with two different dimensionality reduction techniques. The first is to look for correlations between the features themselves. The Correlation Filter identifies highly correlated features, and then removes those that are least correlated with the dependent variable. The other technique is Principal Component Analysis, which uses concepts from linear algebra to define transformations of the original variables that retain as much information as possible but with a smaller number of variables. (In a later workflow, we will also apply the Variable Selection operator, which uses statistical tests to remove variables with less predictive power.)

The output of the Correlation Filter and the PCA are saved for use by the Data Science team.

Key Technique - Replace Outliers

In real-world datasets, unusually large or negative values maybe be signs of dirty data. For example, a value of "1000" or "-1" in an "age" field might be a code used for unknown values, or it might be a sign of a data entry problem. In other cases, outliers may be genuine observations representing chance events that are not useful for building predictive models. In such cases, the Replace Outliers operator may be used to clean the data.

Data Quality Monitoring and Reporting

Third template, for monitoring and reporting on data quality

Besides cleansing the input data, it may be useful to generate measurements of the quality of the data and make them available as reports. This workflow creates simple metrics of data anomalies and sparseness, and generates frequency charts and reports that may be scheduled to run regularly on the latest data.

To simplify our calculations, we begin by creating an indicator or counter variable for each of the input variables - the counter is 1 when the input variable is null, and 0 otherwise. We can then sum these up to provide an overall measure of the sparseness of each row of data. Creating these counter variables is easy, using the "multi" option in the Variable operator. This lets us define a counter for just one of the input variables, and then the Variable operator will automatically generate a counter for all of the input variables. In the following "Pig Execute" operator, the counters are summed up across the entire row of data. (This could also be done with the Variable operator, but the Pig Execute operator provides a simple Pig editor and a convenient template that lists all of the incoming variables.)

Next, we use Row Filters to separate those rows that contain missing values, and those that have complete data. For the latter dataset, we normalize the data and then identify anomalies by counting rows that are more than a specified number of standard deviations from the mean. Again, we use the "multi" option in the Variable operator.

Finally, we merge the three data streams together into a single result set that has labels indicating whether rows contain missing, anomalous, or 'good' data. We also generate a frequency chart that indicates the relative frequency of these rows.

Frequency chart for categorizing unclean data

Key Technique - The "multi" Option in the Variable Operator

Creating new variables commonly involves using the same variable expression repeatedly on a large number of input variables. For example, a dataset may contain hundreds of variables that measure pressure at different points in a factory process, or that measure the number of visits to each page on a web property. In these cases, it may be desirable to apply the same custom transformation to each input variable. The Variable operator has a 'multi' option which allows the user to define a single expression on one variable, and then to apply it to all variables, automatically via a simple substitution.

Key Technique - Pig Execute for Complex Transformations

While the Variable operator offers great flexibility for defining transformations of input variables, it can be more difficult to define complex transformations that involve many lines of code, or that don't easily operate row-by-row. The Pig Execute offers complete flexibility for ad hoc transformations, for custom code, for complex expressions, and for non-standard operations.

Check It Out!

For access to this Playbook, including its workflows, sample data, a PowerPoint summary, and expert support from Spotfire® Data Science data scientists, contact your Spotfire® Data Science sales representative.

Sign In