Jump to content

Template for using XGBoost in Spotfire® 1.0.0


1 Screenshot

Summary

Extreme Gradient Boosting or XGBoost is a supervised Machine-learning algorithm used to predict a target variable Y given a set of features - Xi

Overview

Extreme Gradient Boosting or XGBoost is a supervised Machine-learning algorithm used to predict a target variable Y given a set of features - Xi. It combines several weak learners into a strong learner to provide a more accurate & generalizable ML model.  This template can be used to build a regression, classification or multi-class classification model, then use the model to make predictions from a new dataset.   

XGBoost falls in the same class of ML algorithms as the GBM (Gradient Boosting model). Both these models use an iterative technique known as boosting that builds a number of decision trees one after the other while focusing on accurately predicting those data points that were not accurately predicted in the previous tree.

While both these ML models follow the same principle, XGBoost is better than traditional GBM for these reasons:

  • Regularization: XGB uses regularization while training the model thereby controlling over-fitting of the ML model, which could lead to incorrect predictions on unseen data.
  • Performance: XGB is faster because it allows multi-core processing (parallel processing)
  • Handling sparse data sets: XGB better handles missing values in the data
  • Cache Optimization: Data structures & Algorithms are cache optimized to best use system's hardware resources.

 

Spotfire Template for XGBoost

Spotfire's XGBoost template provides significant capabilities for training an advanced ML model and predicting unseen data. The XGBoost template offers the following features:

  • Ease of navigation: Icon-based navigation on each page of the template will walk through all the steps necessary before building an XGBoost model.
  • Grid search capability: The template allows users to specify multiple values for each tuning parameter separated by a comma. The data function in the back end will create a grid of the combinations of tuning parameters and will find the best possible model.
  • One-Hot encoding:  The XGBoost data function will perform one-hot encoding on the training dataset to create new columns from factor variables. Please note - a factor variable with many levels will considerably slow down the model training process.
  • Regression/Classification/Multi-class classification capability: The template is designed to solve three different types of problems as shown above. Users must specify the correct data type for the response variable before model creation.

The goal of a supervised learning algorithm is to predict accurately a label 'y' based on pattern in the other features - Xi - of the data set. XGBoost constructs several decision trees iteratively for this purpose. Initially, a decision tree is constructed and the label is predicted. The next decision tree will focus on correctly predicting the data points that were predicted incorrectly by the first tree. This process continues until the user-specified number of trees is reached.

The focus on wrong predictions is maintained by assigning weights to those data points to lay more emphasis on a correct prediction in the next tree. Such a model is also called an ensemble model.

 

Advantages

  • Produces highly accurate models as a result of multiple decision trees & regularization
  • Very good model training performance
  • Users can specify custom optimization objectives and evaluation criteria

Disadvantages

  • Outliers in the data set can affect model quality
  • More training time since trees are built iteratively.

Watch a video of this template:

 

General Market Landscape

Huge amounts of data are created every minute in the world today. Companies around the world have adopted different techniques to extract value out of this data. Some of these techniques are:

  • Visual Discovery - Companies employ teams of data analysts to extract the data available in their databases, build meaningful dashboards and bring out useful patterns that help in making informed business decisions and formulating long-term business strategy. This process while useful is prone to human error and consumes a lot of time.
  • Supervised Learning - Supervised learning takes a more refined approach to finding meaningful patterns. In this process, labelled data is used with machine-learning models like Random Forest to predict a numerical target column (Customer LTV) or a column with categories (product likely to purchase). Some famous supervised learning algorithms include Random Forest, Gradient Boosting models, Extreme Gradient Boosting (XGBoost) etc.
  • Unsupervised Learning - Unsupervised learning is the process of using unlabeled data (has no target column to predict) and finding data points that are similar to each other. Some popular unsupervised learning algorithms include K-means clustering, Auto encoders etc.

 

References

XGBoost explanation

Machine Learning mastery

 

Release P 1.1

Published: July 2017

  • Added ROC curve visualization to evaluate binary responses

 

Release P1.0

Published: May 2017

Initial Release


×
×
  • Create New...