Jump to content
  • Distribution Fitting and Normality Testing with DSML Toolkit for Python


    This article discusses the functions available to perform distribution fitting and normality testing using the DSML Toolkit.

    Introduction

    Distribution fitting and normality testing is useful, and at times, even a critical process across numerous industries. However, with large amounts of data, this process can become repetitive and tedious. The DSML toolkit aims to simplify the distribution fitting and normality testing processes with functions that can be applied to full datasets, rather than working with one column at a time. In this article, we will dive deeper into the different functions currently available in the DSML toolkit, along with additional resources, and example applications in Spotfire.

    Prerequisites

    To begin using any of these functions, it’s important to note that any input data must be numeric. If you are working with non-numeric data and plan on using any of the discussed functions, please use an encoding method of your choice on the data. 

    The following distributions are currently available for each of the discussed functions, unless otherwise specified. In parentheses is the formatting with which each distribution should be written when being used as an input for a function. The formatting in parentheses matches the name of the respective distribution object in the scipy.stats Python module.

    • Continuous distributions:
      • Normal (norm)
      • Exponential (expon)
      • Gamma (gamma)
      • Beta (beta)
      • Uniform (uniform)
      • Double Weibull (dweibull)
      • Pareto (pareto)
      • T (t)
      • Lognormal (lognorm)
    • Discrete distributions:
      • Poisson (poisson)
      • Binomial (binom)
      • Geometric (geom)

    The following packages are required to use this portion of the DSML toolkit: numpy, pandas, matplotlib, scipy, and distfit.

    Normality Testing

    The DSML toolkit provides two general methods of normality testing - visual and statistical.

    Visual Normality Testing

    Two types of visualizations play a key role in normality testing - the histogram and the Quantile-Quantile (QQ) plot. With the DSML toolkit, you can easily create the components needed to build these visualizations directly in Spotfire, or have premade plots returned in a python environment of your choice. 

    For histograms, the function will return a table containing probability density function (pdf) values at evenly spaced intervals in the original data for each input column, which can be used to plot the probability distribution curve. When overlaying the probability distribution curve over the histogram, you can visually determine how well your data follows a normal distribution.

    For QQ plots, the function will return a table containing the theoretical quantiles if the data followed a normal distribution and the observed quantiles for each column of data. These quantiles can be plotted with a scatter plot, visualizing how the observed quantiles compare to the theoretical quantiles. If the quantiles are similar, they will roughly follow a line at y=x, which is often overlaid on QQ plots, indicating that the data may fit a normal distribution.

    The images below show example premade outputs, which can optionally be returned by the function, and an example of how the output values can be used to produce histograms and QQ plots directly in Spotfire.

    image.thumb.png.565ec74d8f9c8669ac9c10aa4fa715f2.png

    image.thumb.png.b2edade30aec7116a4162b65ce84ccca.png

    Screenshot2024-03-13at10_50_40AM.thumb.png.a4e552de5f9c0bb48478f2e7f53508cd.png

     Statistical Normality Testing

    Normality can also be tested for using statistical tests; in the DSML toolkit, the Shapiro-Wilk test and the Anderson-Darling test. In summary, the Shapiro-Wilk test is a test of normality, and the Anderson-Darling test tests whether a set of data is likely to have come from a particular distribution–in this case, a normal distribution. You can read more about the Shapiro-Wilk test here, and the Anderson-Darling test here. Using the DSML toolkit, you can conduct both statistical tests on several columns of data using one function call, and have the results returned in a table that can be displayed in Spotfire.

    Code Examples

    import spotfire_dsml.distribution_fitting.distribution_fitting as dfit
    
    # will return output dataframe to create qq plots and parameters, but not premade plots
    df, params = dfit.visual_normality_testing(data = input_data, plot_type = “qq”, return_params = True, return_plots = False)
    
    # will return output dataframe with results
    df = dfit.statistical_normality_testing(data = input_data, alpha = 0.05)

    Parameter Estimation

    Parameter estimation is an extremely important step in the process of distribution fitting. With the DSML toolkit, you can estimate parameters not just for multiple columns, but for multiple distributions as well. For example, if one column in your data should be fit to a normal distribution, while another column should be fit to a gamma distribution, this function allows you to estimate all of the distribution parameters necessary with a single function call. Then, a table with each column and its respective distribution and estimated parameters will be returned, which can be displayed in Spotfire. The output of the parameter estimation function can also be used as an input for the statistical distribution fitting function, which will be discussed below.

    Code Examples

    import spotfire_dsml.distribution_fitting.distribution_fitting as dfit
    
    # parameter estimation for one distribution, ideal for using in Spotfire
    df = dfit.estimate_population_parameters(data = input_data, distribution = “gamma”)
    
    # parameter estimation for multiple distributions
    # dictionary with multiple distributions to estimate parameters for, following format of {key=column:value=distribution}
    distribution_dict = {“feature_1”:”norm”,
    		“feature_2”:”beta”,
    		“feature_3”:”gamma”,
    		“feature_4”:”lognorm”,
    		“feature_5”:”t”,
    		“feature_6”:”dweibull”}
    
    df = dfit.estimate_population_parameters(data = input_data, distribution = distribution_dict)

    Distribution Fitting

    Similar to the normality testing functions, the DSML toolkit provides two general methods of distribution fitting - visual and statistical.

    Visual Distribution Fitting

    Once again, histograms and QQ plots are the focus of the visual distribution fitting function. In the case of continuous distributions, histograms and QQ plots provide the same insights as those created to check for normality, but are just using a different distribution to produce the probability density function (pdf) curve values and/or the theoretical quantiles. For discrete distributions, a QQ plot cannot be created, and in the histogram, instead of plotting the probability density function values, you need to plot the probability mass function (pmf) values. The pmf values are calculated for each unique value in the data, and instead of being plotted as a curve, can be plotted with points and/or a line connecting these points. Please refer to the example Spotfire application for more instructions on how to generate these plots within Spotfire.

    Statistical Distribution Fitting

    For non-normal continuous distributions and discrete distributions, either the Kolmogorov-Smirnov test or the Chi-Square test is applied to test for a well fitting distribution. Both test whether a set of data is likely to have come from a particular distribution, with the Kolmogorov-Smirnov test being applied to non-normal continuous distributions, and the Chi-Square test being applied to discrete distributions. These tests can be applied to several columns, with the results being returned in a table that can be displayed in Spotfire. You can read more about the Kolmogorov-Smirnov test here, and the Chi-Square test here.

    Code Examples

    import spotfire_dsml.distribution_fitting.distribution_fitting as dfit
    
    # will return output dataframe to create histograms, parameters, and premade plots
    df, params = dfit.visual_distribution_fitting(data = input_data, distribution = “poisson”, plot_type = “histogram”, return_params = True, return_plots = True)

    image.thumb.png.39599c4af9a39cecf0fda3db7841c757.png

    # will return output dataframe with results
    # input for dist_params matches output from estimated_population_parameters
    df = dfit.statistical_distribution_fitting(data = input_data, dist_params = estimated_params, alpha = 0.05)

    Distribution Prediction

    If you don’t have a desired distribution that you’d like to fit your data to, you can use the DSML toolkit to predict which distribution best fits your data, along with the estimated parameters. This function uses residual sum of squares (RSS) as the goodness of fit measure used to compare input data to a variety of theoretical distributions. You can read more about the residual sum of squares here.

    Please note that this function should only be used with continuous variables, as only the continuous distributions above are compared in this function.

    Code Example

    import spotfire_dsml.distribution_fitting.distribution_fitting as dfit
    
    # will output dataframe with each column from input_data, the best fit distribution, RSS value, and parameters
    df = dfit.predict_best_distribution(data = input_data)

    Probability Prediction

    Once you have a well fitting distribution and parameters for your data, you can continue to utilize that information by predicting the probability of new data occurring within that distribution. With a column of new data, along with your distribution and parameters, this function will return a table of the new data and each value’s predicted probability, which can be displayed directly in Spotfire. For continuous distributions, the probability density function is used to calculate the probability, while the probability mass function is used for discrete distributions.

    Code Example

    import spotfire_dsml.distribution_fitting.distribution_fitting as dfit
    
    # will output dataframe with original column values and predicted probabilities
    preds = dfit.predict_proba(data = input_data[“feature_1”], distribution = “norm”, params = “12, 2”)

    Example Workflow

    Below is an example workflow demonstrating when to utilize each of the functions mentioned in this article.

    YT2uo8BKoS0rgPdov7Ez7HT7Lk6F_ewKUsc4RiOnsTQAW1dCIDCOLyGzkdmrlsFDAZXbMTbiK6lpDT1K328P7WPFnSiphQmqxIUXb0h_c2XQqxm7MsUXu2OF42pJ-Dm68iHb8AYNsN1b8enbvNUMeqM

    Additional Resources

    For more details on the DSML Toolkit for Python, check out this community article. Example Spotfire applications, including the example application mentioned earlier utilizing different functions in the toolkit can be downloaded from the Exchange page here.

     


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...