Summary
Overview
Clustering is a technique to group objects together based on similarities between the objects in the group. For example, a marketing organization could use clustering to identify groups of customers that exhibit similar interests, and to segment them according to a variety of variables (age, gender, income). Clustering can also be used to classify manufactured units by their failure signatures, identify (Financial) crime hot spots, and identify regions with similar geological characteristics. After identifying clusters, this data function then ranks the variables according to their influence on cluster formation.
This data function clusters data rows based on multiple numeric input columns using K-means clustering, ranks input columns by importance in determining clusters with a random forest model, and applies a log transformation to inputs if appropriate..
Description
- This TERR data function accepts an input table, and uses K-means clustering to find groups of similar rows.
-
The data function produces several outputs for Spotfire, including:
- A new column, "ClusterID", that indicates the cluster number of each row.
- A summary of the relative importance of the data columns in determining the clusters.
- The names of the top 2 most influential variables.
- Some validation metrics that can be used in evaluating the best number of clusters to use.
Usage
In practice, the columns to be analyzed can be selected on the fly in Spotfire as shown in the attached example.
Inputs to the data function |
|||
Name |
Structure |
Required? |
Description |
AnalysisData |
Numeric table |
Yes |
Numeric data to be analyzed. Rows represent observed data points to be clustered. In Spotfire:
|
user.NClusters |
integer value |
No (default=0) |
User-specified number of clusters to find.
|
Outputs from the data function |
||
Name |
Structure |
Description |
ClusterID |
String column |
Column containing cluster (segment) number of each row, as a string, e.g. "Segment07" etc.
|
N.clusters |
Integer value |
Number of clusters actually used.
|
TopClusterVariable1 |
String value |
Name of most significant variable In Spotfire:
|
TopClusterVariable1 |
String value |
Name of second most significant variable |
status.message |
String value |
Overall status of configuration
In Spotfire:
|
VariableSummary |
Table |
Summary table of the importance of each incoming variable in determining the clusters (found using Random Forest) Columns:
|
clusterMetrics |
Table |
Validation table to investigate the number of clusters to use. Columns:
In Spotfire:
|
References
https://en.wikipedia.org/wiki/K-means_clustering
https://en.wikipedia.org/wiki/Random_forest
https://cran.r-project.org/web/packages/randomForest/index.html
A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18--22.
External R packages used
- randomForest (developed using version 4.6-10)