This data function clusters data rows based on multiple numeric input columns using K-means clustering, ranks input columns by importance in determining clusters with a random forest model, and applies a log transformation to inputs if appropriate..
Description
- This TERR data function accepts an input table, and uses K-means clustering to find groups of similar rows.
-
The data function produces several outputs for Spotfire, including:
- A new column, "ClusterID", that indicates the cluster number of each row.
- A summary of the relative importance of the data columns in determining the clusters.
- The names of the top 2 most influential variables.
- Some validation metrics that can be used in evaluating the best number of clusters to use.
Usage
- In practice, the columns to be analyzed can be selected on the fly in Spotfire as shown in the attached example.
Inputs to the data function |
|||
Name |
Structure |
Required? |
Description |
AnalysisData |
Numeric table |
Yes |
Numeric data to be analyzed. Rows represent observed data points to be clustered. In Spotfire:
|
user.NClusters |
integer value |
No (default=0) |
User-specified number of clusters to find.
|
Outputs from the data function |
||
Name |
Structure |
Description |
ClusterID |
String column |
Column containing cluster (segment) number of each row, as a string, e.g. "Segment07" etc.
|
N.clusters |
Integer value |
Number of clusters actually used.
|
TopClusterVariable1 |
String value |
Name of most significant variable In Spotfire:
|
TopClusterVariable1 |
String value |
Name of second most significant variable |
status.message |
String value |
Overall status of configuration
In Spotfire:
|
VariableSummary |
Table |
Summary table of the importance of each incoming variable in determining the clusters (found using Random Forest) Columns:
|
clusterMetrics |
Table |
Validation table to investigate the number of clusters to use. Columns:
In Spotfire:
|
References
https://en.wikipedia.org/s/article/K-means_clustering
https://en.wikipedia.org/s/article/Random_forest
https://cran.r-project.org/web/packages/randomForest/index.html
A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18--22.
External R packages used
- randomForest (developed using version 4.6-10)
Recommended Comments
There are no comments to display.