Data Requirements
We use public loan performance data from Lending Club to illustrate the concepts in this Playbook, encompassing 42K loans originated between 2007 and 2011 with 56 features per loan such as applicant length of employment, home ownership, and credit history, as well as historical information on how offered loans have performed. Your dataset should contain similar data on loan applicants.
Feature Engineering
First template workflow for data cleansing and feature engineering.
Our first step will be to clean the dataset and create features that support our risk model. We filter out three main classes of data:
-
Loans where the delinquency status is unknown (for example, loans with little or no payment history)
-
Features that are irrelevant to the creditworthiness of an application (for example, ID columns) or present in very few of the applications
-
Features that are captured after the loan is offered (for example, payment history)
Second, we create new features in the dataset that support the modeling process. Most importantly we create a binary classification label, is_bad, where all performing loans are assigned a 0 and all non-performing loans (whether late, in default, or charged off) are assigned a 1. In addition we bucket continuous variables where appropriate, such as mapping employment duration to <1 year, 1-10 year, and > 10-year buckets.
Exploratory Visualization
Second template for exploratory data visualization.
The second step is to visually explore the characteristics of your dataset. This flow performs various aggregations - by loan status, region, and home ownership, among others - and connects those aggregations to visualization operators that help surface correlations and trends in the data. Features that have a particular impact on loan outcomes are good candidates for use in the modeling stage.
Modeling
Third template for predictive modeling.
The final step is to build a classification model on loan features that predicts those loans where is_bad is 1 (loans likely not to perform). For this step, we copy the cleansed dataset into Hadoop for some final transformations (replacing nulls and outliers), then apply three classification techniques to our training sample, which we've resampled to over-represent bad loans.
Key Technique - Correlation Filter
Several of the variables in the Lending Club dataset are highly correlated with one another, such as fico_range_low and fico_range_high, and loan_amnt and funded_amnt. The Correlation Filter operator automatically identifies groups of columns that are correlated with one another past a configurable threshold and filters all but one. The remaining column is the one most correlated with the dependent variable (is_bad in this case).
In our analysis on the LendingClub data, the Alpine Forest model was best at predicting bad loans, with 99% accuracy and 91% recall on the hold-out set. We provide a variety of other evaluation measures and export this high-performing model out to the workspace as PMML.
Check It Out!
For access to this Playbook, including its workflows, sample data, a PowerPoint summary, and expert support from Spotfire® Data Science data scientists, contact your Spotfire® Data Science sales representative.
Recommended Comments
There are no comments to display.