Summary
Overview
This data function focuses on regression models that have numeric responses. Predictors can be categorical or continuous. The function features automatic handling of nonlinear relationships and variable interactions, high prediction accuracy and automatic variable selection.
Purpose: This function provides the ability to use the CRAN gbm package within the Spotfire interface. It is focused on Regression. GBM will also do classification, but this is not addressed in this release.
GBM stands for Gradient Boosting Machines. It's a wellknown machine learning technique with a number of advantages.
 Automatic handling of nonlinear relationships
 Automatic handling of variable interactions
 High Prediction Accuracy
 Automatic Variable Selection
The CRAN implementation also has simple automated handling of missing data. (in some cases, some preprocessing of missing values could improve results).
There are 3 files:
 GBM Regression for TIBCO Spotfire Vx.sfd
 GBM Regression for TIBCO Spotfire Vx.dxp
 This README file
Installing the gbm package in Spotfire:
The package installs as usual from CRAN. There is one prerequisite: an installation of Java and the setting of JAVA_HOME.
If you are running locally (without Stat Services), you can use the Spotfire Tools/TERR Tools interface. You can also use the traditional install.packages("gbm") from the TERR command line. If you are using Stat Services, the latter method is the only way to install gbm, from TERR's command line on the server.
Installing the data function into your dxp containing your data: GBM Regression for Spotfire Vx.sfd can be imported into a dxp directly using Tools/Register Data Functions or Insert/Data Function/From File . Note that most parameters are optional and will be assigned reasonable default values. This should make it easy to add this function to your own dxps. The provided dxp shows the use of most of the parameters. It can be used for analysis, or as a model for creating your own dxps.
Using the gbmRegression.dxp file with your own data:
The primary function of the .dxp file is to provide an example illustrating how the embedded data function could be wired up to your data in your own .dxp. It is not intended to provide a complete analysis solution but you can replace the embedded data with your own data using the following procedure:
 The input in the dxp is the UserTable. You can start with the provided dxp and just replace the UserTable with your own data
 Go to the variable selection tab and select Predictors and Response
 Go to the GBM tab and enter Configuration Parameters, then click the Go button.
Model Inputs:
Name 
Description 
Type 
Required 
Data Types 
predictors.df 
Model Predictor columns 
Table 
Yes 
Integer, Real, SingleReal, String, Date 
response.df 
Response Column 
Column 
Yes 
Real 
model.name 
Enter a name 
Value 
No 
String 
n.trees 
number of trees 
Value 
No 
Integer 
holdout.sample.size 
number of holdout rows 
Value 
No 
Integer 
n.minobsinnode 
Minimum Observations in Tree Nodes 
Value 
No 
Integer 
bag.fraction 
Value between 0 and 1 
Value 
No 
Real 
model.path 
directory path to store model 
Value 
No 
String 
learning.rate 
Value between 0 and 1 
Value 
No 
Real 
interaction.depth 
1=no interactions, 2=interactions between at most 2 variables, 3=interactions between at most 3 variables 
Value 
Yes 
Integer 
Notes
 Select one or more continuous or categorical predictor variables.
 Select one outcome variable. NOTE: the dxp assumes that 0/1 outcome should be analyzed as binomial, and that other numeric distributions indicate the response should be analyzed as Gaussian. Other choices can be made available by modifying the code.
 Fill in the name of the model you want to create. If you reuse a name, the older version will be overwritten.
 Number of Trees to Build: GBM is an ensemble model which builds many tree models in sequence; each one attempts to improve the fit by analyzing the residuals of the trees build so far. Too few trees leave out some details that could improve the fit to the training data (Underfitting) Too many trees can overfit, which results in less accuracy when extrapolating to new data (Holdout Sample). Typically we want the best results on a holdout sample, and will tune the number of trees accordingly.
 Holdout Sample Size: If there is sufficient data, it is good to use 2050% of your data for tuning the model. Specify the number of rows here. If you specify Zero, gbm will use the unused data from each step to estimate the prediction error (OOB estimate).

Interaction Depth: Specify the number of interactions to search for.
 1=no interactions
 2=interactions between at most 2 variables
 3=interactions between at most 3 variables
 Minimum Observations in Tree Nodes: Terminal nodes must contain at least this many observations or be disqualified for inclusion in the model
 Bag Fraction: To add variability, each step uses just this proportion of the data. Smaller values build trees faster but generally require more trees. Value between 0 and 1
 Model Path  where to store the completed model on disk
 Learning Rate: Prevent overshooting the right values by approaching it more slowly with a lower learning rate; or speed things up with a higher learning rate. Value between 0 and 1
Model Outputs
Name 
Description 
Type 
valid.error 
Sample Error by number of trees 
Table 
best.ntrees 
Best number of trees message 
Value 
best.ntrees.int 
Best number of trees 
Value 
model.quality 
RMSE 
Value 
importance.table 
Variable Importance Table 
Table 
msg 
Started and Completed Date/Times 
Value 
Screenshot of example .dxp page showing many of the model inputs and outputs.
Release v1.2
Published: June 2016
Initial release