Jump to content
  • User's Guide to Self-Organizing Map Data function for Spotfire with TERR (now called Spotfire Enterprise Runtime for R)


    This article explains how to use the Self-Organizing Map data function which lets you explore variations in high-dimensional data and is available for download from the Exchange.

    Overview

    A Self-Organizing map is an algorithm that "maps" the high-dimensional information in your data onto a constructed two-dimensional grid in such a way that when you explore this 2D grid, you are taking a guided tour through the most "interesting" parts of your data.  Self-organizing maps have been around since the 1980s but the dynamic combination of Spotfire with this algorithm is quite powerful.

    The main result of the algorithm is the map itself.   Conceptually this map is a series of tiles, or "buckets" arranged in a two-dimensional lattice. 

    On the main page of the Spotfire analysis file (which can be downloaded together with the data function from the Exchange), each cell in the lattice map is portrayed as a star pattern.  Each arm of the star represents one variable in the analysis data set, for example, if you select 3 variables to analyze, each star will have 3 arms, one per variable.   The relative length of each arm is scaled and represents the relative value of that variable.

    Each row of the original data will fall into exactly one of these lattice buckets - when there are many more data rows than stars, each star will be a summary of its collection of data. One of the properties of the map is that nearby cells (the stars) are similar to one another.   Even though neighboring cells do not share any of the original data, the stars "look" the same - the stars vary somewhat smoothly across the page.   In this sense, the SOM algorithm has sorted the data into an arrangement on the 2D grid.

    The map can be computed with a coarse grid, or a fine grid (limited by the number of data points) and can be arranged as a hexagonal or rectangular grid.  As a result, the individual cells in the map are not significant in themselves, rather it is the regions of the map (a neighborhood of several nearby cells) that provide insight.

    Analysis Plots

    The data that comes with the Spotfire dxp file is a simple synthetic data set with just 4 columns named a, b, c, and d:

    screen_shot_2020-05-11_at_13_56_48.png.0ee7b7cbdf70ab975514690b738a3be3.png

    The Spotfire dxp file contains a property control that brings up a list of names of all the numeric variables in your data, and lets you make a selection of which variables to analyze.  Here we've selected variables a, b and c:

    screen_shot_2020-05-11_at_13_57_05.png.de7b2a34790c4950e77c602e2ac27a78.png

    The SOM algorithm runs automatically and refreshes the "star" visualization.   Here we've selected the hexagonal layout, and chosen a relatively coarse grid:

    screen_shot_2020-05-11_at_13_57_36.thumb.png.25e25f2989e6d0b49058862412af7ec0.png

    Three variables (a, b and c) were selected, so each star has three arms, color coded by the variable. The top-left stars have relatively large values of all variables.  The lower-left corner of the plot has data with low values of variable c (the red arm is very short).  Neighboring stars have very similar properties.

    The concepts are probably easier to see with a real data set, so we next demonstrate how to replace the data, with a more substantial data set.

    Replacing Data: Analysis of USDA Food Nutrition Data:

    The United States Department of Agriculture (USDA) researches the nutritional contents of foods, and provides nutritional data of for a variety of foods, sampled in 100-gram amounts. This USDA National Nutrient Database for Standard Reference data is available through their website, Food Data Central; a version has also been archived on Kaggle here.

    In Spotfire, we replace the main analysis data table with this new data.   We see a warning in Spotfire that new columns are being computed (we dismiss the warnings).

    screen_shot_2020-05-11_at_14_21_47.png.fe3868adcb2b51286e04a049ff1534a0.png

    We can choose any combination (two or more) of variables to analyze, here we choose to analyze these foods according to the Protein, Total Fat and Sugar content:

     

    screen_shot_2020-05-11_at_14_28_14.png.8e1b0042f6aae1fc7727ebc52fa183cc.png

    The star plot updates with the variables we selected.    On the star map we can now mark one of the star patterns, toward the top, with a relatively long Protein content (orange symbol):

    screen_shot_2020-05-11_at_14_24_03.thumb.png.6e15ff8c12165344ec9a20b8aa0a44c4.png

    We create a details visualization of the original data set (limited with the same "SOM Marking" used to mark the star chart), and are able to view a list of these high-protein foods:

    screen_shot_2020-05-11_at_14_25_20.png.a5593f5eddea932ff00e4b09074cb7eb.png

    We now analyze the Calcium, Carbohydrate, Protein and Saturated Fat of foods, and increase the number of grid cells.

    The star pattern starts to reveal some interesting areas. Here we select a small region of the map and discover we've found the cheese selection (click on image to enlarge):

    screen_shot_2020-05-11_at_14_46_46.thumb.png.37d1371608c4832a5b605055c582a9b6.png

    Selecting a different area of the map now finds pastries and pie crusts (click to enlarge):

    screen_shot_2020-05-11_at_14_47_41.thumb.png.8ab60673df7005c6d1f7c4dbef1d46f0.png

    The Spotfire dxp file also contains a tab where each variable's value is displayed across the map separately, as an individual heatmap for each variable. Again you can see the variables vary smoothly:

    screen_shot_2020-05-11_at_14_42_28.thumb.png.f5612ff61f487e3ba8f3fa447875dfd1.png

    For some variables, e.g. Calcium, there is a single distinct "hot spot" of foods with large calcium values.  For other variables, e.g. Carbohydrates, the panel indicates there are two distinct regions of foods that have high carbohydrate values.  The Self Organizing Map has arranged these into different groups, based on all of the other variables, not just each variable independently of the others.

    Work flows

    The preceding investigations have started with patterns in the map and examined the corresponding data

    A complementary approach is to start with the data and use the map to identify similar "nearby" data points.   For example, there may be a small number of rows in that data that catches your interest, and you are interested in discovering any similar data points in the dataset.  

    • To proceed, you would start by marking the data point or points of interest, possibly from a scatter plot or other diagnostic plot.  
    • With these data points marked, one or more star patterns would appear as marked (Spotfire has not marked all of the underlying data points but that star pattern shows that at least one underlying data row is marked. 
    • Moving to the star plot, you would then mark the star pattern (the one that appears to be marked).  This action in fact marks all of the data points within this cell.
    • Finally, looking back at the data, you would observe additional data points highlighted, corresponding to the points in the star.
    • You could expand the range of stars (cells) marked, to expand the data to be examined.

    Discussion

    The Self Organizing Map has some similarities to other unsupervised learning methods, for example, kmeans clustering.   One difference is that kmeans presents the user with a certain number of clusters of data, and the emphasis is on determining the "best number of clusters" that cover the entire data set.    The Self Organizing Map is attractive because it simply organizes the data and lets the user choose their own groups


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...