Jump to content
  • Airline Delay Prediction


    This article demonstrates how to build a what-if scenario Touchpoint in Spotfire Data Science - Team Studio for predicting airline delays. Based on flight details, including weather, origin and destination airport, and carrier, the model predicts the likelihood of being delayed for over 15 minutes.

    Use Case Overview

    This article demonstrates how to build a what-if scenario Touchpoint for predicting airline delays. Based on flight details, including weather, origin and destination airport, and carrier, the model predicts the likelihood of being delayed for over 15 minutes.

    Data Requirements

    We use actual airline departure and arrival data from 2007-2008, along with metadata on airports, carriers, and airplanes. In addition, we associate historical weather data with each flight at both the origin and destination airports.

    Initial Transformations and Joins

    airlinedelay1.thumb.png.75c8e7a5a36977c82004714be8b73e6c.png

    First template workflow for joining the disparate tables and basic transformations.

    Our first workflow contains basic transformations for the flight metadata. We remove diverted and canceled flights, calculate plane age at time of flight, and cleanse nulls. We also perform a bit of exploratory data visualization on delay counts across distance of flight, carrier, and day of week.

    Key Technique - Copy to Hadoop/Database

    The data for this playbook exist in both database tables (metadata) and HDFS files (individual flight data). To arrive at a final feature set for modeling, we need to do a join across these distinct data sources. The Copy operators facilitate just such an operation. In this case we want our final dataset to live on HDFS for Spark-based modeling, so we use the Copy to Hadoop operator to move the database data into Hadoop before joining. In the case of airplane data, we perform some ETL before this copy, and in general that's a best practice: ETL in home datasource, copy ETLed data to destination data source, joins and modeling in the destination source.

    Adding Weather Context

    screen_shot_2017-06-29_at_3_24.47_pm(1).thumb.png.4ef340ba514f1c8cbcc66cfed82d955c.png

    Second template for adding in historical weather data.

    The second step is to bring in weather data from NOAA over the same time period as our flight dataset. These data have a few issues that we remedy with transformation operators:

    • Sky condition has a large number of distinct values that we simplify to overcast and clear.
    • Visibility is measured from 0 - 10 but there are occasional outliers measured on a 0 - 100 scale.
    • Temperature and precipitation data are sometimes missing.

    Modeling

    screen_shot_2017-06-30_at_11_19.29_am(1).thumb.png.9dc48214ef45570f24f9f9a5dcfefd34.png

    Modeling flow split to create models that work both with and without weather data.

    In the modeling flow, we normalize all the final numeric features (such as precipitation and plane age) to a 0 - 1 scale, and perform k-means clustering to add a cluster membership variable for use by the classification algorithms (flights in natural clusters likely experience similar delays). We then split into with- and without-weather data modeling branches. The models using the weather data are significantly more accurate, but of course those data are not always available.

    What-if Delay Prediction Touchpoint

    screen_shot_2017-06-30_at_11_23.01_am(1).thumb.png.4b4a6ff1384b1a5480f8406370c8cb96.png

    Touchpoint for predicting flight delays based on flight and environment characteristics.

    With the prediction delay models created above we drive a Touchpoint that allows its users to enter flight and weather details and receive a delay prediction plus a bar chart showing how all similar flights performed in the historical dataset.

    Check It Out!

    For access to this Playbook, including its workflows, sample data, a PowerPoint summary, and expert support from Spotfire® Data Science data scientists, contact your Spotfire® Data Science sales representative.


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...