Overview
Random Forest is a machine-learning algorithm that aggregates the predictions from many decision trees on different subsets of data. This technique allows the model to be more accurate in predicting new data.
General Market Place
Huge amounts of data is created every minute in the world today. Companies around the world have adopted different techniques to extract value out of this data. Some of these techniques are:
Visual Discovery - Companies employ teams of data analysts to extract the data available in their databases, build meaningful dashboards and bring out useful patterns that help in making informed business decisions and formulating long-term business strategy. This process while useful is prone to human error and consumes a lot of time.
Supervised Learning - Supervised learning takes a more refined approach to finding meaningful patterns. In this process, labelled data is used with machine-learning models like Random Forest to predict a numerical column (Customer LTV) or a column with categories (product likely to purchase). Some famous supervised learning algorithms include Random Forest, Gradient Boosting models, Extreme Gradient Boosting (XGBoost) etc.
Unsupervised Learning - Unsupervised learning is the process of using unlabeled data and finding those data points that do not fit the pattern exhibited by the rest of the data. Some popular unsupervised learning algorithms include K-means clustering, Auto encoders etc.
Random Forest Algorithms Explained
Random forests follow a technique known bagging (also known as Bootstrap aggregation). This is an ensemble technique where a number of decision trees are built based on subsets of data and an aggregation of the predictions is used as the final prediction.
An illustration of this technique can be seen in the graphic below -
The above illustration shows three decision trees and a classification obtained from each of them. The final prediction is based on majority voting and will be 'Class B' in the above case.
When the random forest algorithm receives the data, it first subsets the data by selecting sqrt(Number of columns) for classification or (Number of columns)/3 for regression. It also takes a bootstrap sample of the rows of data. The algorithm will create as many subsets as the number of trees specified.
Then, a decision tree is built using each subset of data and a prediction is computed. A final prediction is computed based on the results of the individual predictions.
Random Forests have the following advantages:
- Solves the problem of model overfitting
- Runs efficiently for large datasets
- Handles missing data
- Ensures that the model is more generalizable
- Output variable importance
Spotfire Template for Random Forest
Spotfire's Random forest template uses a distributed random forest trained in H2O for best in the market training performance. It can be configured with document properties on Spotfire pages and used as a point and click functionality.
Download the template from the Component Exchange. See documentation in the download distribution for details on how to use this template.
Recommended Comments
There are no comments to display.