Random Forest - Data Function for Spotfire® - Data Functions

About This Item
Releases Info

Summary

Random forests are an ensemble decision tree machine learning method for classification and regression.

Introduction

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of over-fitting to their training set.

Random forests can be used in many areas, such as modelling and prediction of binary response variable, such as offer acceptance, customer churn, financial fraud or product / equipment fail; as well as explanation of detected anomalies.

This data function includes the R/TERR code for the Random forest model, missing data imputation and random Over-Sampling Examples to resolve un-balanced class issue for binary response variables. It uses the CRAN randomForest package within the Spotfire interface. It is focused on supervised classification with a binary response variable. Random forest can also be used with unsupervised machine learning, but this is not addressed in this release.

The distribution also includes an Iron-python script to filter dependent variables based on the selection of the independent variable.

Data Function Documentation

Description of Input parameters to the data function:

Name	Structure	Required
Explan.Vars	Data table with arbitrary number and names of explanatory columns (Column type: integer/real/string) Suggest sending multiple columns using Spotfire Expression $map("[AnalysisData].[${ExplanatoryColumns}]", ",") where ExplanatoryColumns is a document property limited through (datatype:real or datatype:integer or datatype:string) and isIncluded:TRUE and not depColumn	Yes
resp.Col	Data table with binary response in the form of integer (1/0) or string (churn/active) format Suggest sending multiple columns using Spotfire Expression [AnalysisData].[${depColumn}], where depColumn is a Spotfire document property	Yes
resp.Indicator	character string to indicates the true state (event happens) from resp.Co	Yes

Description of output parameters to the data function:

Name	Structure	Required
resp.rate	value	Percentage of true stage from the response variable
rf.pred.test	Table with 4 columns	Information on true positive/true negative/false positive/false negative counts and percentage
rf.importance	Table with 4 columns	Variable of Importance table with mean decrease accuracy and mean decrease Gini index
rf.ModelAssessment	Table with 2 columns	Information for generating the ROC curve
Msg.Error	string value	Fatal error message during the function execution
Msg.Warn	string value	Warning message during the data cleansing stage
rf.model.Obj	binary value (blog)	Random Forest Modelling object

Spotfire demo (.dxp) file

Using the dxp file with your own data:

The distribution also includes a Spotfire .dxp file. The primary function of the .dxp file is to provide an example illustrating how the embedded data function could be wired up to your data in your own .dxp. It is not intended to provide a complete analysis solution. However, you can still replace the embedded data with your own data using the following procedure:

1) The input in the dxp is the AnalysisData table. You can start with the provided dxp and just replace the AnalysisData with your own data

2) Go to the variable selection tab and select the independent and dependent variable

3) Click Refresh Model button

Release P1.0

Published: March 2017

Initial Release

Sign In

Random Forest - Data Function for Spotfire® 1.0

2 Screenshots