Use Case Overview
This article follows on from the article Data Quality Management and Anomaly Detection which illustrates methods for modeling the effects of economic conditions on bank reserves or other highlevel bank indicators. The data is a simulated set of economic and performance data for a bank, such as might be used for "stress testing" the capital reserves of bank holding companies in the U.S.. We use these data to build a predictive model and a Touchpoint for whatif scenarios related to financial stresstesting.
Data Requirements
The inputs are quarterly balance sheet data for several different business units of a bank. The dependent variables measure noninterest expenditures and capital reserves. Features represent major economic indicators (e.g. U.S. GDP) and bank profile data (e.g. number of employees in the bank, aggregated loan amounts).
'Stress Test' Modeling
First template workflow, for predicting risk to a bank based on changing economic conditions
In the Playbook on Data Quality Management we prepared a dataset that contained economic and business indicators for a U.S. bank, as well as general measures of the bank's reserves and expenditures. In this workflow, we we will build predictive models that use major economic indicators (e.g. U.S. GDP) and bank profile data (e.g. number of employees in the bank, aggregated loan amounts) to predict the impact on the bank's capital reserves.
The dataset originally had almost 1,000 economic metrics, many of which contained sparse or erroneous or irrelevant information. We used Null Value Replacement, Replace Outliers, and other techniques, to cleanse the data, and we used dimensionality reduction techniques  Correlation Filter and PCA  to reduce the number of variables to about
In this workflow, we will use linear regressions and regression forests to build predictive models. We will then evaluate combinations of different models against the two different dimensionality reduction techniques, to see which models work best on outofsample data.
First, for each of the two inputs (one from PCA, one from the Correlation Filter), we generate a training and testing sample. In this case, the resulting subsamples are quite small, so we want to avoid overfitting. We apply Linear Regression and Alpine Forest Regression, using all of the input variables. Forests are generally good at avoiding overfitting. For the Linear Regression we use elastic net penalties to simplify the model. We also train a second linear regression having first applied Variable Selection, to reduce the number of input variables further.
Next, we apply the trained models to the test sample, and then use the Regression Evaluator to compute standard statistics like Rsquared, RMSE (root mean squared error), and MAPE (mean absolute percent error). Notice that, in the results below, the linear regression is very accurate on the training sample, but much less accurate than the Forest on the testing sample  a clear sign of overfitting. By contrast, the Alpine Forest model seems like something that can be used to make reasonable predictions.
Model results on the training sample:
Model results on the testing sample
Key Technique  Avoid OverFittingA model is overfit when it is very accurate on the dataset where it was trained, but much less accurate on a holdout sample. This may happen when a model is more complex than it needs to be, or when a dataset has too few observations (rows) compared to features (columns). Here are some ways to avoid overfitting:

Model Scoring and Simulation
Second template workflow, for simulating changes in bank reserves based on changing economic conditions
The workflow illustrates how simulations may be performed with a Spotfire Data Science workflow variables and with Touchpoints. The idea is to simulate a scenario with fabricated data, substituting new values into historical data, and then to run the trained model over the modified data.
In the previous workflow, we found that the Alpine Forest model on the output of the Correlation Filter produced the most accurate predictions. In this workflow, we will apply this model to a simulation of the next quarter, with certain key economic indicators replaced by alternative values, so that we can predict the effect of a changing economy on the bank's reserves.
First, we select one of the models that we've trained in the previous workflow, and we use the Load Model operator so that we can apply it via the Predictor operator. In order to run a simulation, we need to apply the model to one or more rows of simulated data. We can do this by adding an entirely new row of data, representing the next quarter, but instead we will modify a few values of an existing row of data, namely the row representing the most recent quarter.
So we use Row Filter to select the last quarter, and then the Variable operator to substitute simulated values for three of the economic factors. We use workflow variables, as illustrated below, so that we can expose them in the Touchpoint:
Variable definitions with workflow variables for substituting simulated values into historical data
We then run the model against this single row of (partially) simulated data, and then merge it into the original dataset (with the last quarter removed). We end up with a dataset containing the historical and simulated values and a chart displaying the predicted and actual values over time.
Whatif Simulation (Touchpoint)
Touchpoint for simulating the effects of changing economic conditions on bank reserves
Exposing the inputs and outputs of the workflow above is now easy, using Touchpoints. We create a Touchpoint that maps the workflow variables to three visual input parameters, and which then displays the merged historical and simulated data and the chart.
The user may now enter a new scenario in the form of hypothetical values for the macroeconomic factors (e.g. US GDP) and then hit 'Run' to generate the result set and the charts. The output is displayed as follows:
Bar chart showing historical and simulated values of bank reserves
Key Technique  Using Touchpoints, Models, and Workflow Variables for Running Simulations
Touchpoints make it very easy to run simulations using userprovided scenarios. If you have a model that makes predictions based on historical data, you can construct futurelooking scenarios by copying historical data forward (under the assumption that the future will generally look like the past) and then modifying the historical data by adding trends and new values. These trends and values can be governed by user inputs  initially as workflow variables, and then exposed via the Touchpoints user interface.
Check it out!
For access to this Playbook, including its workflows, sample data, a PowerPoint summary, and expert support from Spotfire® Data Science data scientists, contact your Spotfire® Data Science sales representative.
Recommended Comments
There are no comments to display.