Jump to content
  • Missing Data Navigator for Spotfire® - Overview


    This article describes in detail the Missing Data Navigator for Spotfire® application.

    Introduction

    The Missing Data Navigator for Spotfire® is an interactive tool for analyzing and handling missing data, which can be downloaded from our Exchange pages. It is constructed to lead you through the typical steps of missing data analysis with the aim of providing guidance along this path. It is implemented as a Spotfire application, using visualization Mods and Python data functions. Users can replace the default dataset with their own data and get started with the analysis.

    The following tasks are covered by the application: 

    • It investigates where and how much data is missing.
    • Generates bite-size reports.
    • Suggests an optimal strategy to clean data by selected row/column deletion.
    • Cleans the data by applying the selected strategy.
    • Compares descriptive statistics of the original and cleaned dataset.

     

    Demo

    Here is a short demo of the application:

     

    Highlights of the main features

    We will show the different parts of the application with basic descriptions and purpose explanations. For more details, you can review the documentation of the application - inside the download.

    1. Loading the data and starting the analysis

    image.thumb.png.6a1d0a7067fb0a522dbe718afb3f638e.png

    The purpose of this this initial page is to define inputs for analysis, check the data, and trigger the analysis.

    • Variable selection on the left: Select columns you would like to add to your analysis.
    • "Recalculate MD plot" button:  Press to update the graph on the left.
    • Missing data plot: The graph shows holes in the data (missing data are shown in red), on x-axis are column names, on y-axis row indices.
    • "Run MD analysis" button: Pressing the button “Run MD analysis” triggers extensive analyses that will populate subsequent pages with the results and reports.

     

    2. Raw data results

    image.thumb.png.ccfb58b98c473d478329ac889b0148ae.png

    This page summarizes missing values in input data. The automated procedure will remove first invariant variables and then all empty columns and rows. This will result in the data called Raw1. The results on the Raw data report page refer to the pre-processed table Raw1

    • Analysis report - insights: A verbal description of the findings of the analysis for the preprocessing and raw data analysis stage.
    • Raw data characteristics table: It summarizes overall info about the table concerning missing values, Raw0 is input data, Raw1 is preprocessed data mentioned above.
    • Two bar charts in the middle: Graphical representation of completeness of rows and columns (how many rows/columns we have with a specific percentage of missing values).
    • Bottom "Column summary" table and "Column/variable MD information" bar chart: These give information about the situation within columns. They are for the identification of the most problematic columns. 

     

    3. Rows and columns removal strategies

    image.thumb.png.bd23dd2fe9ee71d518e855ca37cece43.png

    The procedure tries various missing data removal scenarios (involving the removal of rows and/or columns). For each step in each removal-logic sequence, the missing data summaries are calculated. These are summarized on this page. 

    • "Scenarios summary" table: Each row of this table is a summary of one scenario of the data (some rows and columns are removed from the Raw1 data), it is the same summary as in the raw data summary but now we are comparing different rows and/or columns removal strategies. After marking some scenarios, the detailed visualization will appear below.
    • Analysis report on the right: A verbal description of the findings of the analysis for rows/columns removal steps
    • "Quality of strategies (MD count)" graph: This is placing different scenarios into metric space evaluating the quality of the scenario. The best outcome would be a low loss of valid points (not missing points removed by the scenario) and a low MD count. The closer the strategy is to the left bottom corner,  the better. 
    • Heatmap "Variables used by Scenarios": Here, you can see which scenarios removed which columns. You can for example see whether or not different approaches lead to the same conclusion.
    • Column/Variable MD info (marked strategies) visual: Give information about the situation within columns (count of not missing values is green, count of missing values is red+orange), by choosing more scenarios in the upper table, you can compare different rows/columns removal scenarios and see what how they are different from others. 

     

    4. Best removal strategies

    image.thumb.png.2868ab2696d24ac8c359deb7db345f15.png

    This page highlights the most interesting rows and/or columns removal scenarios out of all tried (it is a small subset of the scenarios on the previous page). We are comparing always preprocessed input data (Raw1), optimal scenarios based on 2 different metrics, and several special cases when the scenario is optimal based on some constraint.

    • "Important strategies" table: Missing data info for the whole file for different optimal scenarios. Columns reason, detail and important are connected with the optimality flag received for these scenarios.
    • "Quality of strategies" visual: In this graph, we are looking at reducing MD count but with a low loss of valid points. The closer to the lower left corner, the better (closeness to the left bottom corner is equivalent to the first optimality metric).
    • "Comparison of KPIs" spider chart: This graph reacts to the marking of strategies/scenarios, you can compare the most important KPIs in one visual. You can clearly see the differences in more dimensions which is why a spider chart was used.
    • Column/variable MD information: This is the same graph you have already seen on other pages, also reacting on marking.
    • Pick a scenario: For further investigations and the creation of cleaner data, you will need to pick one row/column removal scenario. Here is the place to select this scenario from the dropdown list. This selection influences, what will be displayed on the next page. 

     

    5. Review the selected scenario

    image.thumb.png.7ef604320ef0d70e7d8802b1d6c5d68d.png

    You can see the missing data plot and other plots which you know already from previous pages, the difference is that this page is dedicated only to the specific selected scenario.

    • Variable selection lists on the right: Once you are happy with the selected scenario, the last step is generating clean data based on the selected scenario. On the right of the page, there are several variable selectors. Typically you want to leave the variable selection the same as in the screenshot (None selected for two bottom lists) but in case you want to tweak the final choice of columns, you can do it here.
    • Create a clean table based on the selected scenario: Pressing this button will trigger a data function that will remove columns and rows for specific scenario and create a final cleaner table in your Spotfire application.
    • Last button "output file descriptives" in the bottom right corner: This button points to the next and the last page for a comparison of descriptive statistics between input data and the final selected scenario.

     

    6. Compare summary stats of input and clean data

    image.thumb.png.4938e6f20acbbf393de84a985f02c401.png

    Once you have clean data, it is good to make sure that you did not introduce bias by removing rows and/or columns. For that purpose, we compare distributions of continuous variables (upper left visual and column selector), categorical variables (bottom graph plus variable selector), and absolute difference correlation for continuous variables (upper right graph).

     

    Final clean data for potential further analyses can be found in the data canvas as "clean table" data table. 

     

    Reusing the content

    Several features were extracted and included in the form of Python functions (in our spotfire-dsml library starting with version 1.1.3) or in the form of Spotfire ready-to-use data functions (these are part of the examples for spotfire-dsml library). You can use these data functions to easily replicate parts of Missing Data Navigator application. More precisely, there were extracted functions for missing data summary (in fact section 2 of this article), missing data removal function (this one is used many times under the hood to create row/column removal scenarios) and comparison of distributions of two data sets (content from section 6 in this article).

    The missing data module of spotfire-dsml library also offers additional features not included in this release of the Missing Data Navigator (for example a wide range of missing data imputation features).  Be aware, we are planning to incorporate more features in the Missing Data Navigator as well as spotfire-dsml in the future.


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...