Jump to content

Random Forest - Data Function for Spotfire® 1.0


2 Screenshots

Summary

Random forests are an ensemble decision tree machine learning method for classification and regression.

Introduction

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of over-fitting to their training set.

Random forests can be used in many areas, such as modelling and prediction of binary response variable, such as offer acceptance, customer churn, financial fraud or product / equipment fail; as well as explanation of detected anomalies.

This data function includes the R/TERR code for the Random forest model, missing data imputation and random Over-Sampling Examples to resolve un-balanced class issue for binary response variables.  It uses the CRAN randomForest package within the Spotfire interface.  It is focused on supervised classification with a binary response variable.  Random forest can also be used with unsupervised machine learning, but this is not addressed in this release.

The distribution also includes an Iron-python script to filter dependent variables based on the selection of the independent variable.

 

Data Function Documentation

Description of Input parameters to the data function:

Name  

Structure

Required

Explan.Vars

Data table with arbitrary number and names of explanatory columns (Column type: integer/real/string)

 Suggest sending multiple columns using Spotfire Expression

$map("[AnalysisData].[${ExplanatoryColumns}]", ",")  where ExplanatoryColumns is a document

property limited through  (datatype:real or datatype:integer or datatype:string) and isIncluded:TRUE and not depColumn

Yes

resp.Col  

Data table with binary response in the form of integer (1/0) or string (churn/active) format

Suggest sending multiple columns using Spotfire Expression [AnalysisData].[${depColumn}], where depColumn is a Spotfire document property

Yes

resp.Indicator

character string to indicates the true state (event happens) from resp.Co

Yes

Description of output parameters to the data function:

Name  

Structure

Required

resp.rate

value

Percentage of true stage from the response variable

rf.pred.test

Table with 4 columns

Information on true positive/true negative/false positive/false negative counts and percentage

rf.importance

Table with 4 columns

Variable of Importance table with mean decrease accuracy and mean decrease Gini index

rf.ModelAssessment

Table with 2 columns

Information for generating the ROC curve

Msg.Error

string value

Fatal error message during the function execution

Msg.Warn

string value

Warning message during the data cleansing stage

rf.model.Obj

binary value (blog)

Random Forest Modelling object

Spotfire demo (.dxp) file

Using the dxp file with your own data:

The distribution also includes a Spotfire .dxp file.  The primary function of the .dxp file is to provide an example illustrating how the embedded data function could be wired up to your data in your own .dxp. It is not intended to provide a complete analysis solution.  However, you can still replace the embedded data with your own data using the following procedure:

1) The input in the dxp is the AnalysisData table. You can start with the provided dxp and just replace the AnalysisData with your own data

2) Go to the variable selection tab and select the independent and dependent variable

3) Click Refresh Model button

Release P1.0

Published: March 2017

Initial Release


×
×
  • Create New...