XGBoost Machine Learning Template for Spotfire®

Download this template from the Exchange

Watch a demo of this template

XGBoost

Extreme Gradient Boosting (or) XGBoost is a supervised Machine-learning algorithm used to predict a target variable 'y' given a set of features – Xi. It combines several weak learners into strong learners to provide a more accurate & generalizable ML model.

General Market Landscape

Huge amounts of data are created every minute in the world today. Companies around the world have adopted different techniques to extract value from this data. Some of these techniques are -

Visual Discovery

Companies employ teams of data analysts to extract the data available in their databases, build meaningful dashboards and bring out useful patterns that help in making informed business decisions and formulating long-term business strategies. This process while useful is prone to human error and consumes a lot of time.

Supervised Learning

Supervised learning takes a more refined approach to find meaningful patterns. In this process, labeled data is used with machine-learning models like Random Forest to predict a numerical target column (Customer LTV) or a column with categories (product likely to purchase). Some famous supervised learning algorithms include Random Forest, Gradient Boosting models, Extreme Gradient Boosting (XGBoost) etc.

Unsupervised Learning

Unsupervised learning is the process of using unlabeled data (which has no target column to predict) and finding data points that are similar to each other. Some popular unsupervised learning algorithms include K-means clustering, Auto encoders etc.

XGBoost Algorithm

XGBoost falls in the same class of ML algorithms as the GBM (Gradient Boosting model). Both these models use an iterative technique known as boosting that builds a number of decision trees one after the other while focusing on correctly predicting those data points that were predicted wrongly in the previous tree.

While both these ML models follow the same principle, XGBoost is better than traditional GBM for three main reasons ?

Regularization: XGB uses regularization while training the model thereby controlling the over-fitting of the ML model, which could lead to incorrect predictions on unseen data.
Performance: XGB is faster because it allows multi-core processing (parallel processing)
Handling sparse data sets: XGB better handles missing values in the data
Cache Optimization: Data structures & Algorithms are caches optimized to best use the system's hardware resources.

Explained

The goal of a supervised learning algorithm is to predict accurately a label 'y' based on the pattern in the other features - Xi - of the data set. XGBoost constructs several decision trees iteratively for this purpose. Initially, a decision tree is constructed and the label is predicted. The next decision tree will focus on correctly predicting the data points that were predicted incorrectly by the first tree. This process continues until the user-specified number of trees is reached.

The focus on wrong predictions is maintained by assigning weights to those data points to lay more emphasis on a correct prediction in the next tree. Such a model is also called an ensemble model.

Advantages

Produces highly accurate models as a result of multiple decision trees & regularization
Very good model training performance
Users can specify custom optimization objectives and evaluation criteria

Disadvantages

Outliers in the data set can affect the model quality
More training time since trees are built iteratively.

Spotfire Template for XGBoost

TIBCO Spotfire's XGBoost template provides significant capabilities for training an advanced ML model and predicting unseen data. The XGBoost template offers the following features -

Ease of navigation: Icon-based navigation on each page of the template will walk through all the steps necessary before building an XGBoost model.

Grid search capability: The template allows users to specify multiple values for each tuning parameter separated by a comma. The data function in the back end will create a grid of the combinations of tuning parameters and will find the best possible model.

One-Hot encoding: The XGBoost data function will perform one-hot encoding on the training dataset to create new columns from factor variables. Please note ? a factor variable with many levels will considerably slow down the model training process.

Regression/Classification/Multi-class classification capability: The template is designed to solve three different types of problems as shown above. Users must specify the correct data type for the response variable before model creation.

References:

XGBoost explanation

Machine Learning mastery

Sign In