Jump to content

Clustering with Variable Importance Data Function for Spotfire® 1.04


1 Screenshot

Summary

This data function clusters objects together based on similarities between the objects in each cluster. After identifying clusters, the function then ranks the input variables according to their influence on cluster formation.

Overview

Clustering is a technique to group objects together based on similarities between the objects in the group.  For example, a marketing organization could use clustering to identify groups of customers that exhibit similar interests, and to segment them according to a variety of variables (age, gender, income).  Clustering can also be used to classify manufactured units by their failure signatures, identify (Financial) crime hot spots, and identify regions with similar geological characteristics.  After identifying clusters, this data function then ranks the variables according to their influence on cluster formation.  

This data function clusters data rows based on multiple numeric input columns using K-means clustering, ranks input columns by importance in determining clusters with a random forest model, and applies a log transformation to inputs if appropriate..

 

Description

  • This TERR data function accepts an input table, and uses K-means clustering to find groups of similar rows.
  • The data function produces several outputs for Spotfire, including:
    • A new column, "ClusterID", that indicates the cluster number of each row.
    • A summary of the relative importance of the data columns in determining the clusters.
    • The names of the top 2 most influential variables.
    • Some validation metrics that can be used in evaluating the best number of clusters to use.

Usage

In practice, the columns to be analyzed can be selected on the fly in Spotfire as shown in the attached example.

Inputs to the data function

Name

Structure

Required?

Description

AnalysisData

Numeric table

Yes

Numeric data to be analyzed.  Rows represent observed data points to be clustered.

In Spotfire:

  • Numeric columns can be directly sent to the data function
  • A Spotfire control  can be set up to populate a document property listing columns of interest, and these columns passed in. See the dxp file for details.

user.NClusters

integer value

No (default=0)

User-specified number of clusters to find. 

  • 0 = find the maximum statistically significant number.
  • Other integer: This will be used unless it is larger than statistically significant, in which case the maximum number will be used. 
  • Often small numbers of clusters are easier to interpret.

Outputs from the data function

Name

Structure

Description

ClusterID

String column

Column containing cluster (segment) number of each row, as a string, e.g. "Segment07" etc.

  • Column has same number of rows as incoming data (AnalysisData)
  • This can be appended back to the original table to investigate relation of clusters and variables.

N.clusters

Integer value

Number of clusters actually used.

  • If the input variable user.NClusters was not specified, or 0, this will be the maximum number of statistically significant clusters.
  • Otherwise N.clusters is the smaller of user.NClusters and this maximum significant number.

TopClusterVariable1

String value

Name of most significant variable

In Spotfire:

  • If TopClusterVariable1 and 2 are stored in document properties, a scatter plot visualization can be configured where these properties control the variables appearing on x- and y-axes.   This visualization will update as the cluster is re-run.

TopClusterVariable1

String value

Name of second most significant variable

status.message

String value

Overall status of configuration

  • "OK" if configuration is ok
  • Otherwise it will provide additional configuration steps needed (e.g. R package installation)

In Spotfire:

  • Configuring a text label containing this message provides a quick way of checking the status.

VariableSummary

Table

Summary table of the importance of each incoming variable in determining the clusters (found using Random Forest)

Columns:

  • "variable" = name of original column
  • "importance" = numeric importance of this column
  • "transform" = either "Logarithmic" or "Linear" that can help configure a scatter plot

clusterMetrics

Table

Validation table to investigate the number of clusters to use.

Columns:

  • Number = integer number of clusters considered.
  • Within-Cluster SS = numeric value holding the within-cluster sum of squares.  This will decrease and level off as more clusters are used.
  • Between-Cluster SS = numeric value holding the between-cluster sum of squares.  This will increase and level off as more clusters are used.
  • Hartigan = metric proposed by Hartigan for determining cutoff.  Values above 10 considered significant.
  • Hartigan flag = 0 or 1 flag signifying if the given cluster is significant or not.
  • Hartigan Threshold = the Hartigan metric minus the threshold value (10) so it can be easily evaluated as being <0 or >0

In Spotfire:

  • It is useful to make a plot of these metrics on the y-axis, with Number on the x-axis. 
  • The document property N.clusters can be used to draw a vertical line at the # clusters used.
  • "Hartigan flag" can be used to control the shape of the symbols to indicate statistical significance.

References

https://en.wikipedia.org/wiki/K-means_clustering

https://en.wikipedia.org/wiki/Random_forest

https://cran.r-project.org/web/packages/randomForest/index.html

A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18--22.

External R packages used

  • randomForest (developed using version 4.6-10)

Spotfire Platform Release v1.04

Published: May 2016

Initial release
 

 


×
×
  • Create New...