Introduction
Across industries, handling missing data is a crucial step in any project. Without properly handling missing data, models can become biased, and results can be inaccurate. The DSML toolkit aims to expedite and simplify the process of handling missing data, and currently contains functions to explore, remove, and impute missing values, as well as run comparison analyses before and after missing data is handled. In this article, we will dive deeper into these functions, including code examples to demonstrate how to call and use each Python function, as well as additional resources and example applications in Spotfire.
Prerequisites
Before using any of these functions, it’s important to have a general understanding of your data. Sometimes, missing values in some columns might be equivalent to some specific value. First, users need to understand if there are any values, other than empty cells, that represent a missing value. Examples include “missing”, “N/A”, “?”, “Unknown”, etc.
Next, please be aware that there may be different types of missingness in your data, and not all should be treated the same way. It would be beneficial for you to identify if your data is missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). Definitions for these terms can be found here.
- MCAR: the data that is missing is independent of the observed or unobserved data.
- MAR: the data that is missing is independent of the unobserved data, but is related to the observed data.
- MNAR: the data that is missing is independent of the observed data, but is related to the unobserved data. When this occurs, it is generally more difficult to apply removal or imputation techniques without biasing the data.
Additionally, certain imputation methods are not valid on non-numeric data. The following methods are only valid on numeric data:
- Mean (part of simple imputation)
- Median (part of simple imputation)
- Minimum (part of simple imputation)
- Maximum (part of simple imputation)
- K-Nearest Neighbors (KNN)
-
Multiple Imputation by Chained Equations (MICE)*
- *MICE requires numeric data, but the option to apply ordinal encoding to non-numeric data is available through the DSML toolkit. For more information, please refer to the section on MICE below.
The constant and most frequent methods (both part of simple imputation) are suitable for all data types.
Exploration
Summarize Missing Values
The DSML toolkit provides a summarization function, which will generate tables summarizing missing values by columns, missing values by rows, missing values across the entire dataset, as well as generate text reports explaining the findings. When this function is used to create a data function in Spotfire, the output tables can be displayed like this:
Code example
import spotfire_dsml.missing_data.evaluation as mde
# five outputs from this function call
File_summary, row_summary, column_summary, report, preprocessed_data = mde.summarize(data, missing_data_values = None, preprocess = True)
Preparation
Remove Missing Values
Once missing data has been identified and better understood, the process of handling the missing values can begin. One method for handling missing data is through removing missing values; with the spotfire-dsml, you can set thresholds and criteria to remove certain missing values with a single function call. Typically, you would use this feature to automatically remove sparse rows and/or columns. For example, you may want to only remove rows where all data is missing, but additionally want to remove columns where more than 60% of data is missing.
Code example
import spotfire_dsml.missing_data.preprocessing as mdp
# following the example mentioned above
data_after_MDremoval = mdp.removal(data, row_threshold = “all”, col_threshold = 0.6, priority = “both”, missing_data_values = None)
Imputation
Another approach to handling missing data is imputation, with which there are numerous available methods. The imputation methods available in the DSML toolkit are outlined below.
Simple Imputation
Simple imputation refers to replacing missing values with a single value, and there are a variety of ways to define what that single value will be. With spotfire-dsml, you can replace missing values with values specific to that column:
- mean
- median
- min
- max
- most_frequent
Or, you can replace missing values with a constant, fixed value of your choice (method = constant).
Additionally, the DSML toolkit allows you to define multiple simple imputation methods across an entire dataset, with a user-defined table. With a table of two columns (“Column” and “Method”), you could have some columns be imputed with the mean values, while other columns are imputed with the most frequent values, and some remaining columns are imputed with a constant value. This is especially useful if you have numeric and non-numeric column types. The methods “mean”, “median”, “min”, and “max” are only suitable for numeric columns, while the methods “most_frequent” and “constant” are suitable for all column types.
Check out the code sample below for example usage of this data function.
Code examples
import spotfire_dsml.missing_data.preprocessing as mdp
# single method = mean
output_df = mdp.simple_imputation(data, methods = “mean”, missing_data_values = None, fixed_value = None)
# single method = constant, fixed_value required
output_df = mdp.simple_imputation(data, methods = “constant”, missing_data_values = None, fixed_value = “Medium”)
# example multiple methods table definition
Methods_table = pd.DataFrame({
“Column”:[“feature_1”, “feature_2”, “feature_3”, “feature_4”],
“Method”:[“mean”, “median”, “most_frequent”, “constant”]})
# multiple methods, insert table name
output_df = mdp.simple_imputation(data, methods = methods_table, missing_data_values = None, fixed_value = “Medium”)
Random sampling
Random sampling imputation is the method of selecting a random, valid value from a column to replace another missing value in the same column. Random sampling is done with replacement in this case. This method is valid on all data types.
Code example
import spotfire_dsml.missing_data.preprocessing as mdp
output_df = mdp.rand_samp_imputation(data, missing_data_values = None)
KNN
KNN imputation, or K-Nearest Neighbors imputation, utilizes the KNN algorithm to impute missing values with the average of a certain number of nearest neighbors. The number of nearest neighbors used can optionally be defined by the user, with the default being 5 neighbors. Note that this method is only valid on numeric data; encoding for non-numeric data is not offered in this function, but may be done by a user beforehand if KNN imputation is the desired method of imputation for those columns.
Code example
import spotfire_dsml.missing_data.preprocessing as mdp
output_df = mdp.knn_imputation(data, n_neighbors = 7, missing_data_values = None)
MICE
MICE, or Multiple Imputation by Chained Equations, is an imputation method that iteratively predicts what a missing value should be based on the column’s relationship with other columns in the dataset. This method can be computationally expensive, especially for large datasets. One way to reduce the amount of computation needed is to set the number of nearest features used to a lower number; by default, all features are used (setting n_nearest_features = None will use all features). Additionally, different estimators will timeout if generating the output takes more than 10 seconds. First, RandomForestRegressor is used - it’s a robust estimator, but can run extremely slowly on large datasets. After 10 seconds, RandomForestRegressor is aborted in favor of BayesianRidge. If BayesianRidge also times out after 10 seconds, LinearRegression is used. LinearRegression runs much more quickly than the other estimation methods, but relies on the assumption that features have a linear relationship with each other. Please read more about the different estimators and MICE before proceeding.
MICE is only valid on numeric data, but the DSML toolkit implementation contains encoding capabilities. If encode = True, ordinal encoding will be applied to all non-numeric columns prior to MICE, then they will be reverse encoded so the output table contains the original, non-numeric values. Please ensure that this encoding method is appropriate for your non-numeric data before proceeding.
Code example
import spotfire_dsml.missing_data.preprocessing as mdp
# all features used, non-numeric variables get encoded
output_df = mdp.mice_imputation(data, missing_data_values = None, n_nearest_features = None, encode = True)
Multiple methods
Similar to applying multiple simple imputation methods to a dataset with a single function call, the same can be done with all imputation methods discussed above. With a table of two columns (“Column” and “Method”), you could have some columns be imputed with simple imputation methods, while other columns are imputed using MICE, and some remaining columns are imputed using random sampling imputation. Again, this is especially useful if you have numeric and non-numeric column types.
Code example
# example multiple methods table definition
Methods_table = pd.DataFrame({
“Column”:[“feature_1”, “feature_2”, “feature_3”, “feature_4”, “feature_5”],
“Method”:[“mean”, “knn”, “mice”, “rand_samp”, “median”]})
Evaluation
Compare Summary Statistics
After missing data is handled via removal and/or imputation, it’s important to do some additional analysis and ensure that the underlying distributions of the data have not been altered too much. The DSML toolkit contains a comparison function to analyze the differences between an original dataset and a cleaned version of the dataset. The outputs contain 3 tables: one with summary statistics for continuous variables (min, mean, max, percentiles, etc.), one with summary statistics for categorical variables (category frequencies per column), and one displaying the difference in column correlations for continuous variables. With this information, the process of handling missing values can also be done iteratively, as one can experiment with what methods will produce the best results for a given dataset and use case.
If you use this function to create a data function in Spotfire, you can create nice summary visuals comparing distributions of all columns. For example:
Code example
import spotfire_dsml.missing_data.evaluation as mde
# original_df is the raw data containing missing values
# new_df can be any output table from any removal or imputation function listed above
descr_cont, descr_cat, cor_cont = mde.compare_summary_stats(original_df, new_df, missing_data_values = None)
Example Workflow
Below is an example workflow demonstrating when to utilize each of the functions mentioned in this article. For further information and examples (including the screenshots seen in this article) on how to create data functions based on the DSML Missing Data Module, you can find them for download as part of this Exchange item.
Additional Resources
On our Community Exchange, you can also find the Missing Data Navigator. This comprehensive Spotfire application utilizes a lot of the features introduced above. The missing data navigator is constructed to lead you through the typical steps of missing data analysis with the aim of providing guidance along this path.
For more details on the DSML Toolkit for Python, check out this community article. Example Spotfire applications, including the example application mentioned earlier utilizing different functions in the toolkit can be downloaded from the Exchange page here.
Recommended Comments
There are no comments to display.