Summary
Introduction
Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of over-fitting to their training set.
Random forests can be used in many areas, such as modelling and prediction of binary response variable, such as offer acceptance, customer churn, financial fraud or product / equipment fail; as well as explanation of detected anomalies.
This data function includes the R/TERR code for the Random forest model, missing data imputation and random Over-Sampling Examples to resolve un-balanced class issue for binary response variables. It uses the CRAN randomForest package within the Spotfire interface. It is focused on supervised classification with a binary response variable. Random forest can also be used with unsupervised machine learning, but this is not addressed in this release.
The distribution also includes an Iron-python script to filter dependent variables based on the selection of the independent variable.
Data Function Documentation
Description of Input parameters to the data function:
Name |
Structure |
Required |
Explan.Vars |
Data table with arbitrary number and names of explanatory columns (Column type: integer/real/string) |
Yes |
resp.Col |
Data table with binary response in the form of integer (1/0) or string (churn/active) format |
Yes |
resp.Indicator |
character string to indicates the true state (event happens) from resp.Co |
Yes |
Description of output parameters to the data function:
Name |
Structure |
Required |
resp.rate |
value |
Percentage of true stage from the response variable |
rf.pred.test |
Table with 4 columns |
Information on true positive/true negative/false positive/false negative counts and percentage |
rf.importance |
Table with 4 columns |
Variable of Importance table with mean decrease accuracy and mean decrease Gini index |
rf.ModelAssessment |
Table with 2 columns |
Information for generating the ROC curve |
Msg.Error |
string value |
Fatal error message during the function execution |
Msg.Warn |
string value |
Warning message during the data cleansing stage |
rf.model.Obj |
binary value (blog) |
Random Forest Modelling object |
Spotfire demo (.dxp) file
Using the dxp file with your own data:
The distribution also includes a Spotfire .dxp file. The primary function of the .dxp file is to provide an example illustrating how the embedded data function could be wired up to your data in your own .dxp. It is not intended to provide a complete analysis solution. However, you can still replace the embedded data with your own data using the following procedure:
1) The input in the dxp is the AnalysisData table. You can start with the provided dxp and just replace the AnalysisData with your own data
2) Go to the variable selection tab and select the independent and dependent variable
3) Click Refresh Model button