Jump to content
  • Handling Missing Data with the DSML Toolkit for Python


    This article discusses the functions available to handle missing data using the DSML toolkit ("spotfire-dsml" Python library).

    Introduction

    Across industries, handling missing data is a crucial step in any project. Without properly handling missing data, models can become biased, and results can be inaccurate. The DSML toolkit aims to expedite and simplify the process of handling missing data, and currently contains functions to explore, remove, and impute missing values, as well as run comparison analyses before and after missing data is handled. In this article, we will dive deeper into these functions, including code examples to demonstrate how to call and use each Python function, as well as additional resources and example applications in Spotfire.

    Prerequisites

    Before using any of these functions, it’s important to have a general understanding of your data. Sometimes, missing values in some columns might be equivalent to some specific value. First, users need to understand if there are any values, other than empty cells, that represent a missing value. Examples include “missing”, “N/A”, “?”, “Unknown”, etc. 

    Next, please be aware that there may be different types of missingness in your data, and not all should be treated the same way. It would be beneficial for you to identify if your data is missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). Definitions for these terms can be found here

    • MCAR: the data that is missing is independent of the observed or unobserved data.
    • MAR: the data that is missing is independent of the unobserved data, but is related to the observed data.
    • MNAR: the data that is missing is independent of the observed data, but is related to the unobserved data. When this occurs, it is generally more difficult to apply removal or imputation techniques without biasing the data.

    Additionally, certain imputation methods are not valid on non-numeric data. The following methods are only valid on numeric data:

    • Mean (part of simple imputation)
    • Median (part of simple imputation)
    • Minimum (part of simple imputation)
    • Maximum (part of simple imputation)
    • K-Nearest Neighbors (KNN)
    • Multiple Imputation by Chained Equations (MICE)*
      • *MICE requires numeric data, but the option to apply ordinal encoding to non-numeric data is available through the DSML toolkit. For more information, please refer to the section on MICE below.

    The constant and most frequent methods (both part of simple imputation) are suitable for all data types.

    Exploration

    Summarize Missing Values

    The DSML toolkit provides a summarization function, which will generate tables summarizing missing values by columns, missing values by rows, missing values across the entire dataset, as well as generate text reports explaining the findings. When this function is used to create a data function in Spotfire, the output tables can be displayed like this:

    AD_4nXeXYmtLiOXGAAbUH84eIWkCashO2oEC6C9wBFnT4t0m3m8GdkX34kK_hR6Vwrb_B7s6n1ZA4sOOLt_mqsv5c65a0Oi4199uzioWQPb3crcqiKM_Ed3YD5smULgb2Sx9rikpHgv0FlUK96B1wLFIeGE5TrBk?key=5Lwa6lXy2CvdYHeipOCjmQ

    Code example

    import spotfire_dsml.missing_data.evaluation as mde
    
    # five outputs from this function call
    
    File_summary, row_summary, column_summary, report, preprocessed_data = mde.summarize(data, missing_data_values = None, preprocess = True)

    Preparation

    Remove Missing Values

    Once missing data has been identified and better understood, the process of handling the missing values can begin. One method for handling missing data is through removing missing values; with the spotfire-dsml, you can set thresholds and criteria to remove certain missing values with a single function call. Typically, you would use this feature to automatically remove sparse rows and/or columns. For example, you may want to only remove rows where all data is missing, but additionally want to remove columns where more than 60% of data is missing.

    Code example

    import spotfire_dsml.missing_data.preprocessing as mdp 
    
    # following the example mentioned above
    
    data_after_MDremoval = mdp.removal(data, row_threshold = “all”, col_threshold = 0.6, priority = “both”, missing_data_values = None)

    Imputation

    Another approach to handling missing data is imputation, with which there are numerous available methods. The imputation methods available in the DSML toolkit are outlined below.

    Simple Imputation

    Simple imputation refers to replacing missing values with a single value, and there are a variety of ways to define what that single value will be. With spotfire-dsml, you can replace missing values with values specific to that column:

    • mean
    • median
    • min
    • max
    • most_frequent

    Or, you can replace missing values with a constant, fixed value of your choice (method = constant).

    Additionally, the DSML toolkit allows you to define multiple simple imputation methods across an entire dataset, with a user-defined table. With a table of two columns (“Column” and “Method”), you could have some columns be imputed with the mean values, while other columns are imputed with the most frequent values, and some remaining columns are imputed with a constant value. This is especially useful if you have numeric and non-numeric column types. The methods “mean”, “median”, “min”, and “max” are only suitable for numeric columns, while the methods “most_frequent” and “constant” are suitable for all column types.

    Check out the code sample below for example usage of this data function.

    Code examples

    import spotfire_dsml.missing_data.preprocessing as mdp
    
    # single method = mean
    
    output_df = mdp.simple_imputation(data, methods = “mean”, missing_data_values = None, fixed_value = None)
    
    # single method = constant, fixed_value required
    
    output_df = mdp.simple_imputation(data, methods = “constant”, missing_data_values = None, fixed_value = “Medium”)
    
    # example multiple methods table definition
    
    Methods_table = pd.DataFrame({
    
    “Column”:[“feature_1”, “feature_2”, “feature_3”, “feature_4”],
    
    “Method”:[“mean”, “median”, “most_frequent”, “constant”]})
    
    # multiple methods, insert table name
    
    output_df = mdp.simple_imputation(data, methods = methods_table, missing_data_values = None, fixed_value = “Medium”)

    Random sampling

    Random sampling imputation is the method of selecting a random, valid value from a column to replace another missing value in the same column. Random sampling is done with replacement in this case. This method is valid on all data types.

    Code example

    import spotfire_dsml.missing_data.preprocessing as mdp
    
    output_df = mdp.rand_samp_imputation(data, missing_data_values = None)

    KNN

    KNN imputation, or K-Nearest Neighbors imputation, utilizes the KNN algorithm to impute missing values with the average of a certain number of nearest neighbors. The number of nearest neighbors used can optionally be defined by the user, with the default being 5 neighbors. Note that this method is only valid on numeric data; encoding for non-numeric data is not offered in this function, but may be done by a user beforehand if KNN imputation is the desired method of imputation for those columns.

    Code example

    import spotfire_dsml.missing_data.preprocessing as mdp
    
    output_df = mdp.knn_imputation(data, n_neighbors = 7, missing_data_values = None)

    MICE

    MICE, or Multiple Imputation by Chained Equations, is an imputation method that iteratively predicts what a missing value should be based on the column’s relationship with other columns in the dataset. This method can be computationally expensive, especially for large datasets. One way to reduce the amount of computation needed is to set the number of nearest features used to a lower number; by default, all features are used (setting n_nearest_features = None will use all features). Additionally, different estimators will timeout if generating the output takes more than 10 seconds. First, RandomForestRegressor is used - it’s a robust estimator, but can run extremely slowly on large datasets. After 10 seconds, RandomForestRegressor is aborted in favor of BayesianRidge. If BayesianRidge also times out after 10 seconds, LinearRegression is used. LinearRegression runs much more quickly than the other estimation methods, but relies on the assumption that features have a linear relationship with each other. Please read more about the different estimators and MICE before proceeding.

    MICE is only valid on numeric data, but the DSML toolkit implementation contains encoding capabilities. If encode = True, ordinal encoding will be applied to all non-numeric columns prior to MICE, then they will be reverse encoded so the output table contains the original, non-numeric values. Please ensure that this encoding method is appropriate for your non-numeric data before proceeding.

    Code example

    import spotfire_dsml.missing_data.preprocessing as mdp
    
    # all features used, non-numeric variables get encoded
    
    output_df = mdp.mice_imputation(data, missing_data_values = None, n_nearest_features = None, encode = True)

    Multiple methods

    Similar to applying multiple simple imputation methods to a dataset with a single function call, the same can be done with all imputation methods discussed above. With a table of two columns (“Column” and “Method”), you could have some columns be imputed with simple imputation methods, while other columns are imputed using MICE, and some remaining columns are imputed using random sampling imputation. Again, this is especially useful if you have numeric and non-numeric column types. 

    Code example

    # example multiple methods table definition
    
    Methods_table = pd.DataFrame({
    
    “Column”:[“feature_1”, “feature_2”, “feature_3”, “feature_4”, “feature_5”],
    
    “Method”:[“mean”, “knn”, “mice”, “rand_samp”, “median”]})

    Evaluation

    Compare Summary Statistics

    After missing data is handled via removal and/or imputation, it’s important to do some additional analysis and ensure that the underlying distributions of the data have not been altered too much. The DSML toolkit contains a comparison function to analyze the differences between an original dataset and a cleaned version of the dataset. The outputs contain 3 tables: one with summary statistics for continuous variables (min, mean, max, percentiles, etc.), one with summary statistics for categorical variables (category frequencies per column), and one displaying the difference in column correlations for continuous variables. With this information, the process of handling missing values can also be done iteratively, as one can experiment with what methods will produce the best results for a given dataset and use case. 

    If you use this function to create a data function in Spotfire, you can create nice summary visuals comparing distributions of all columns. For example:

    AD_4nXeJ56UY1HUoyJl7s52IzHWMKyTVlrcnWQ4Ok-33xfAR-qWOQ4zRJU84m_vcHKJZrHgI-9SyqbfOkESBetohVGgoLXMGSSs6WGkxk3FuP5I2ulw0gzn04t0LkSmz3BRglavexDZavrCkfgv0ixskYG31bmTm?key=5Lwa6lXy2CvdYHeipOCjmQ

     

    Code example

    import spotfire_dsml.missing_data.evaluation as mde
    
    # original_df is the raw data containing missing values
    
    # new_df can be any output table from any removal or imputation function listed above
    
    descr_cont, descr_cat, cor_cont = mde.compare_summary_stats(original_df, new_df, missing_data_values = None)

    Example Workflow

    Below is an example workflow demonstrating when to utilize each of the functions mentioned in this article. For further information and examples (including the screenshots seen in this article) on how to create data functions based on the DSML Missing Data Module, you can find them for download as part of this Exchange item.

     

    AD_4nXePyvivjdgyEwHVq-RUhFVqIDJ_L30XaA0sziRf8YqVoAlM4wegMi5U0yZhj9QJso-gEinfyv5ZMIh6yzlV_gjMSPgaf_FDgtRLOpjCDEUSp0X8SDgFUOjiV9nsP-pNQBv_uyGReD_IcxCrVKiW7tErNxo?key=5Lwa6lXy2CvdYHeipOCjmQ

     

    Additional Resources

    On our Community Exchange, you can also find the Missing Data Navigator. This comprehensive Spotfire application utilizes a lot of the features introduced above. The missing data navigator is constructed to lead you through the typical steps of missing data analysis with the aim of providing guidance along this path.

    For more details on the DSML Toolkit for Python, check out this community article. Example Spotfire applications, including the example application mentioned earlier utilizing different functions in the toolkit can be downloaded from the Exchange page here.


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...