Jump to content
  • Credit Default Risk


    This article demonstrates how to build a classification model for predicting credit default risk in Spotfire Data Science - Team Studio. Risk models are useful for determining the creditworthiness of individual loan applicants, and adjusting the offered interest rate or denying the loan appropriately.

    Data Requirements

    We use public loan performance data from Lending Club to illustrate the concepts in this Playbook, encompassing 42K loans originated between 2007 and 2011 with 56 features per loan such as applicant length of employment, home ownership, and credit history, as well as historical information on how offered loans have performed. Your dataset should contain similar data on loan applicants.

    Feature Engineering

    screen_shot_2017-06-16_at_2_30.54_pm.thumb.png.18ef9e0ebdce7388046338c4317249b9.png

    First template workflow for data cleansing and feature engineering.

    Our first step will be to clean the dataset and create features that support our risk model. We filter out three main classes of data:

    1. Loans where the delinquency status is unknown (for example, loans with little or no payment history)

    2. Features that are irrelevant to the creditworthiness of an application (for example, ID columns) or present in very few of the applications

    3. Features that are captured after the loan is offered (for example, payment history)

    Second, we create new features in the dataset that support the modeling process. Most importantly we create a binary classification label, is_bad, where all performing loans are assigned a 0 and all non-performing loans (whether late, in default, or charged off) are assigned a 1. In addition we bucket continuous variables where appropriate, such as mapping employment duration to <1 year, 1-10 year, and > 10-year buckets.

    Exploratory Visualization

    screen_shot_2017-06-16_at_2_48.45_pm.thumb.png.bda17d42eea3876d4ba5d5d805199ddf.png

    Second template for exploratory data visualization. 

    The second step is to visually explore the characteristics of your dataset. This flow performs various aggregations - by loan status, region, and home ownership, among others - and connects those aggregations to visualization operators that help surface correlations and trends in the data. Features that have a particular impact on loan outcomes are good candidates for use in the modeling stage.

    Modeling

    screen_shot_2017-06-16_at_2_56.08_pm.thumb.png.bf04feb13697a260ba202ce0e79479b7.png

    Third template for predictive modeling.

    The final step is to build a classification model on loan features that predicts those loans where is_bad is 1 (loans likely not to perform). For this step, we copy the cleansed dataset into Hadoop for some final transformations (replacing nulls and outliers), then apply three classification techniques to our training sample, which we've resampled to over-represent bad loans.

    Key Technique - Correlation Filter

    Several of the variables in the Lending Club dataset are highly correlated with one another, such as fico_range_low and fico_range_high, and loan_amnt and funded_amnt. The Correlation Filter operator automatically identifies groups of columns that are correlated with one another past a configurable threshold and filters all but one. The remaining column is the one most correlated with the dependent variable (is_bad in this case).

    In our analysis on the LendingClub data, the Alpine Forest model was best at predicting bad loans, with 99% accuracy and 91% recall on the hold-out set. We provide a variety of other evaluation measures and export this high-performing model out to the workspace as PMML.

    Check It Out!

    For access to this Playbook, including its workflows, sample data, a PowerPoint summary, and expert support from Spotfire® Data Science data scientists, contact your Spotfire® Data Science sales representative.


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...