ChangeLog:
Version 1.0.0, released on 24th November 2023
Version 2.4.0, released April 2024:
- Significantly enhanced performance
- Symmetrical Log Axis implemented
- Color by axis implemented
- Individual y axis and zoom slider configuration now independent
- Display of 95% confidence interval of the mean
- Improved formatting of statistics table
- Better and more flexible formatting of numbers
- Various minor bug fixes and configuration enhancements
The Violin Plot Mod for Spotfire® is a combined Violin and Box Plot visualization. It shows the distribution of data and allows easy comparison of distributions of sets of data.
Download this Mod from the Exchange
Try this Mod in Cloud Spotfire
Why Should I Use the Violin Plot Mod for Spotfire®?
The Violin Plot Mod is a combination of a box-plot and a density plot - it combines the best features of both plots and builds on the Spotfire native box-plot. Box plots are sometimes referred to as box-and-whisker plots. The Violin Plot Mod can be used to:
- Easily compare the key statistical measures for different populations (groups) of data - max, min, median, quartiles, inter-quartile range, etc.
- Visualize the distributions of the populations of data alongside the statistical measures - revealing underlying subtleties in the data, whilst visualizing the overall shape of the data
... and more!
The Violin Plot Mod is key for many industries and use cases. For example:
- Comparing the yield of manufacturing processes, machines, or batches - easily visualize them side-by-side and determine which process, machine or batch significantly differs from others
- Comparing patient cohorts - for example, visualizing adverse event occurrence between different treatments, or different sets of patients, stratified by some measure
- Identifying trends in data - the Mod is great at visualizing data over time, and giving strong insights into trends, outliers and sudden variations in the data
Data requirement
Every mod handles missing, corrupted and/or inconsistent data in different ways. It is advised to always review how the data is visualized.
The Violin Plot Mod allows visualization of any datasets in Spotfire and is key to understanding distributions of data.
Explanation of Violin/Box Plot
In brief, the following image illustrates the main features of a violin/box plot. Although the calculations that are used in a violin/box plot may appear complicated for the uninitiated, it's not necessary to fully understand them to gain the best value from such a plot. Examples that are easy to interpret will be shown later in this article.
It is beyond the scope of this article to detail each of the statistical methods that are used, but briefly:
- Density plot - In statistics, kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on kernels as weights. (Wikipedia - Kernel Density Estimation).
- Min - the minimum value of the data set/subset
- Max - the maximum value of the data
- Quartiles are values that divide a (part of a) data table into four groups containing an approximately equal number of observations. The total of 100% is split into four equal parts: 25%, 50%, 75% and 100%. The first quartile (or lower quartile), Q1, is defined as the value that has an f-value equal to 0.25. This is the same thing as the twenty-fifth percentile. The third quartile (or upper quartile), Q3, has an f-value equal to 0.75. The interquartile range, IQR, is defined as Q3-Q1.
- Median - the middle value in a set of data
- Outliers - defined as any data point with a value greater than the Upper Adjacent Value (UAV) or less than the Lower Adjacent Value (LAV).
(The upper adjacent value (UAV) is the largest observation that is less than or equal to the upper inner fence (UIF), which is the third quartile plus 1.5*IQR. The lower adjacent value (LAV) is the smallest observation that is greater than or equal to the lower inner fence (LIF), which is the first quartile minus 1.5*IQR.)
For more information on the statistical calculations/methods used, please visit the help page on aggregations used in Spotfire..
Setting up the chart
This article will be using a dataset showing avocado sales.
By default, Spotfire will apply an aggregation method to the Y (vertical) axis (e.g. Sum/Avg). The Mod calculates many statistical values on the data, so requires unaggregated data. The Mod will warn that the y-axis expression is aggregated, and thus will most likely lead to incorrect calculations:
The default, and strongly recommended action is to click "Remove aggregation" to remove the aggregation from the y-axis. It is possible to "ignore" the warning, but then the validity of the statistical calculations must be carefully evaluated.
In a similar vein, the Mod requires a value for the "Count" axis. Spotfire will, in most cases, automatically set the Count axis expression to be "count()". There should be no need to change this. However, if an expression is not provided, the following error will be shown:
If the data source you are using does not support "count()" as an expression, then you will need to find a way of supplying some form of row count to the Mod. In brief, Spotfire will supply distinct rows to the Mod. The count() expression supplies the underlying row count for each distinct row. This row count value is then used by the Mod to calculate the statistical metrics used by it.
If you do happen to choose a not-recommended expression for the Count axis, you will see a warning thus:
You can choose to keep the current setting, or set the count axis to its recommended expression. Note the recommendation to use "1" as the expression - this is could be useful if you are using external data in your analysis, and the data source you're using doesn't support the Count() function, but you must be sure that each distinct row or result from that data source represents a single unit of information that you would like to evaluate statistically.
Once the Y-axis and count() expressions have been resolved (and the color axis is assigned - see the important note below), you can then continue with configuring the Mod. The horizontal (X) axis is designed to use categorical data. This allows you to compare distributions of data alongside each other. It is customary to subset the data by some kind of group, or by date. For example, here is how you would subset the data by a grouping present in the data:
Here, the data has been split into two separate cohorts - conventional vs organic. The distributions of the data are shown side-by-side, along with the box plot. Conversely, you can view data over time by specifying a date column on the X-axis:
IMPORTANT Note: The color axis automatically assigned by Spotfire to be different from the X-axis. Upon initial configuration, or if you change the configuration of the color axis to something other than same as the X-axis, or do not use the same configuration on the trellis axis, you will see a warning that only the outliers will be colored according to the color axis.
Further discussion on coloring - the violin and box parts of the visualization can be configured either to use the color axis settings, or to be set to fixed colors. Please see the configuration options to modify the coloring of the violin(s) or box(es).
Marking
Spotfire marking (selecting subsets of data) can be performed within the Violin plot Mod. There are two main ways to mark data:
- Clicking an element to select it: a box plot segment, outlier marker, violin, or comparison circle
- Clicking and dragging (rectangular marking). The behavior of click-drag marking depends on which elements are bounded by the marking rectangle. The most granular way of marking is to mark sections of a violin
Violin marking and no data:
If you mark a section of a violin, only the parts of the violin that contain data will be shown as highlighted. Areas with no data will be shown as gray. Recall that the violin employs a smoothing function, so some parts of the violin may not contain any data. This is an example of a partially marked violin:
- The area at the top (light blue) shows data that has not been marked
- The gray region shows that there is no data within that range
- The dark blue region (bottom) shows data that has been marked
Configuring the Violin Plot Mod
There are many options that can be used to configure the Mod. The main configuration is accessed by way of a dropdown menu in the top right-hand corner of the visualization:
Appearance - Zoom Sliders
This option is used to enable/disable zoom sliders. By default, a single zoom slider is shown for all trellis panels. If you want to have separate zoom sliders for each trellis panel, please enable the "Individual Scale per Panel" under the Y-axis Trellis settings.
Appearance - Violin
The following options can be used to configure the violin feature of the mod:
- Draw Violin Under Box - if this option is selected, the violin(s) will be drawn underneath the box(es)
- Draw Violin Over Box - the violin(s) will be drawn over the box(es)
- Bandwidth - adjust the bandwidth of the Kernel Density Estimation function - this affects the sensitivity of the smoothing function used to draw the violin plot. A smaller bandwidth will reveal more detail in the violin by dividing the data into smaller "bins". A larger bandwidth will smooth the violin plots further
- Coarse, Medium, Smooth (default) - these options alter the resolution of the violin
- Limit Violin to Data Extents - by default, the violins are drawn with smooth "tails". In certain cases, this might be misleading, as the tails might indicate that the data extends beyond the min/max. If preferred, you can constrain the violins to be within the min/max of the data
- Use Fixed Color - if this option is enabled, you can set the violins to a fixed color of your choice; otherwise, they will be colored according to the settings in the color axis of the visualization; If the color axis settings do not match the x-axis settings, or are not used to trellis by, the violins MUST be of a fixed color. In this case, you will only be permitted to set the color (not to enable/disable this setting)
Bandwidth and Histogram/Stepped/Smooth should be used to their best effect with your data. These options will produce different results depending on the shape, granularity and amount of data that you are visualizing. You should experiment with these to give the best results - those that show your data accurately, but where the detail does not obscure the overall picture of the data, and that there is sufficient detail to enable you to perform accurate analysis of the data.
Appearance - Box
- Show Box Plot - show/hide box plots as part of the visualization
- Box Size - the width of the box plots
- Marker Size - the size of markers used when showing outliers as part of the box plots
- Show 95% Confidence Interval of the Mean - enable this option to visualize a small box alongside the boxes to indicate the extents of the 95% Confidence Interval of the Mean
- Use Fixed Color - use this option to fix the color of the boxes, overriding the color axis settings. If you do not fix the colors, they will be colored according to the settings in the color axis of the visualization; If the color axis settings do not match the x-axis settings, or are not used to trellis by, the boxes MUST be of a fixed color. In this case, you will only be permitted to set the color (not to enable/disable this setting). IMPORTANT: also in this case, the outliers will be colored according to the color-axis settings. This could be useful in order to evaluate whether another categorical variable leads to large numbers of outliers
Comparison Circles
- Show Comparison Circles - show/hide comparison circles
- Alpha level - adjusts the sensitivity of the comparison circles - a smaller Alpha level leads to larger circles being drawn, which in turn leads to circles being more likely to intersect, thus larger differences between the distributions being regarded as similar.
Comparison Circles are used to visualize aspects of differences between different distributions. A circle is drawn for each value of the x-axis of the visualization, i.e. one per box/violin. For a more detailed explanation on the comparison circles algorithm, please read Spotfire Comparison Circles Algorithm.
The highlighting of comparison circles behaves slightly differently to the standard Spotfire comparison circles:
Here, the top circle has been highlighted by hovering the mouse cursor over it. The gray circle shows an example of a circle that is similar to the highlighted one, and the red circles are not similar. Markers are shown above the statistics table at the bottom of the visualization:
- Black circle - this indicates the x-axis value/box/violin that corresponds to the highlighted circle
- Gray vertical dash - this indicates that the distribution is similar to the highlighted circle
- Red vertical dashes - these indicate that the distributions are significantly different from the highlighted circle
X-axis
- All values - show all values in the series of possible x-axis values, even when there is no data for a particular x-axis value - it is recommended to use this option when showing dates on the x-axis, so you can see when data is missing for certain date values
- Non-empty values - only show x-axis values where the data is not empty
Y-axis
- Linear scale - show a linear scale on the y-axis
- Symmetrical Log (experimental) - as noted; this functionality is experimental. If you select it, a warning will be shown, which you can dismiss. The symmetrical log y axis is Experimental. The statistical measures for the box Plot and the Violin are calculated on the raw data, and displayed on a symmetrical log Y axis. Log10(y <= 0) is undefined, so this axis is reflected around y = 0, in order that negative y values can be displayed. Where y = 0, a slope value of 1 is set (imagine a line chart where the slope is 1), in order that the y-axis can be continuous.
- Show gridlines - show gridlines for each tick on the y-axis
- Show P-value - display the P-value for the distributions using a one-way ANOVA calculation. Internally, the calculation is using 0.05 as the alpha. If the P-value is less than or equal to 0.05, the differences between some of the means (of the y-values) are statistically significant; if the P-value is greater than 0.05, the differences between the means are not statistically significant
Y-axis Formatting
- Exponent - show the y-axis values, summary table, tooltip values, etc. as exponents - e.g. 1.0e+0
- Floating point - show values as floating point - e.g. 1.53
- Short Number Format - use a short number format to represent values - this will use SI prefixes such as K, M, etc.
- Currency - use currency formatting (you will be able to enter a currency symbol), and SI prefixes will be used, as for the short number format, except for 1.0e+9 (Billion), which will use the B prefix instead of G
- Decimal Places/Significant Figures - this allows you to set the decimal places or significant figures used for the above number formats (it changes to one or the other based on the formatting option chosen above)
Statistics Measures
Each of the statistics measures can be enabled/disabled from this section of the configuration:
- Statistics Table - include this measure in the statistics table
- Reference Line - add reference lines to each box/violin to show this measure
- Trend Line - show a trend line for this measure
Reference lines and trend lines can be customized - it's possible to choose a line style and color for each of them
Trellising
A key piece functionality is trellising. This is useful if you wish to further split the data into various cohorts. A key use case could be comparing the batches of semiconductor wafers produced by various different machines. The X-axis could be used to split the data by batch number, and the trellis axis used to split by machine, or process. The mod trellises the visualization into multiple panels.
Continuing the example of avocado sales, it could be useful to trellis by geography:
The trellis configuration is at the very bottom of the menu.
- Max Number of Columns - the maximum number of columns to show at once
- Max Number of Rows - the maximum number of rows to show at once
- 2
Recommended Comments
There are no comments to display.