Jump to content
  • Clustering with Variable Importance Data Function for Spotfire®


    Clustering is a technique to group objects together based on similarities between the objects in the group. For example, a marketing organization could use clustering to identify groups of customers that exhibit similar interests, and to segment them according to a variety of variables (age, gender, income). Clustering can also be used to classify manufactured units by their failure signatures, identify (Financial) crime hot spots, and identify regions with similar geological characteristics. After identifying clusters, this data function then ranks the variables according to their influence on cluster formation.

    This data function clusters data rows based on multiple numeric input columns using K-means clustering, ranks input columns by importance in determining clusters with a random forest model, and applies a log transformation to inputs if appropriate..

     

    Description

    • This TERR data function accepts an input table, and uses K-means clustering to find groups of similar rows.
    • The data function produces several outputs for Spotfire, including:
      • A new column, "ClusterID", that indicates the cluster number of each row.
      • A summary of the relative importance of the data columns in determining the clusters.
      • The names of the top 2 most influential variables.
      • Some validation metrics that can be used in evaluating the best number of clusters to use.

    Usage

    • In practice, the columns to be analyzed can be selected on the fly in Spotfire as shown in the attached example.

    Inputs to the data function

    Name

    Structure

    Required?

    Description

    AnalysisData

    Numeric table

    Yes

    Numeric data to be analyzed.  Rows represent observed data points to be clustered.

    In Spotfire:

    • Numeric columns can be directly sent to the data function
    • A Spotfire control  can be set up to populate a document property listing columns of interest, and these columns passed in. See the dxp file for details.

    user.NClusters

    integer value

    No (default=0)

    User-specified number of clusters to find. 

    • 0 = find the maximum statistically significant number.
    • Other integer: This will be used unless it is larger than statistically significant, in which case the maximum number will be used. 
    • Often small numbers of clusters are easier to interpret.

    Outputs from the data function

    Name

    Structure

    Description

    ClusterID

    String column

    Column containing cluster (segment) number of each row, as a string, e.g. "Segment07" etc.

    • Column has same number of rows as incoming data (AnalysisData)
    • This can be appended back to the original table to investigate relation of clusters and variables.

    N.clusters

    Integer value

    Number of clusters actually used.

    • If the input variable user.NClusters was not specified, or 0, this will be the maximum number of statistically significant clusters.
    • Otherwise N.clusters is the smaller of user.NClusters and this maximum significant number.

    TopClusterVariable1

    String value

    Name of most significant variable

    In Spotfire:

    • If TopClusterVariable1 and 2 are stored in document properties, a scatter plot visualization can be configured where these properties control the variables appearing on x- and y-axes.   This visualization will update as the cluster is re-run.

    TopClusterVariable1

    String value

    Name of second most significant variable

    status.message

    String value

    Overall status of configuration

    • "OK" if configuration is ok
    • Otherwise it will provide additional configuration steps needed (e.g. R package installation)

    In Spotfire:

    • Configuring a text label containing this message provides a quick way of checking the status.

    VariableSummary

    Table

    Summary table of the importance of each incoming variable in determining the clusters (found using Random Forest)

    Columns:

    • "variable" = name of original column
    • "importance" = numeric importance of this column
    • "transform" = either "Logarithmic" or "Linear" that can help configure a scatter plot

    clusterMetrics

    Table

    Validation table to investigate the number of clusters to use.

    Columns:

    • Number = integer number of clusters considered.
    • Within-Cluster SS = numeric value holding the within-cluster sum of squares.  This will decrease and level off as more clusters are used.
    • Between-Cluster SS = numeric value holding the between-cluster sum of squares.  This will increase and level off as more clusters are used.
    • Hartigan = metric proposed by Hartigan for determining cutoff.  Values above 10 considered significant.
    • Hartigan flag = 0 or 1 flag signifying if the given cluster is significant or not.
    • Hartigan Threshold = the Hartigan metric minus the threshold value (10) so it can be easily evaluated as being <0 or >0

    In Spotfire:

    • It is useful to make a plot of these metrics on the y-axis, with Number on the x-axis. 
    • The document property N.clusters can be used to draw a vertical line at the # clusters used.
    • "Hartigan flag" can be used to control the shape of the symbols to indicate statistical significance.

    References

    https://en.wikipedia.org/s/article/K-means_clustering

    https://en.wikipedia.org/s/article/Random_forest

    https://cran.r-project.org/web/packages/randomForest/index.html

    A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18--22.

    External R packages used

    • randomForest (developed using version 4.6-10)

     


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...