Introduction
Distribution fitting and normality testing is useful, and at times, even a critical process across numerous industries. However, with large amounts of data, this process can become repetitive and tedious. The DSML toolkit aims to simplify the distribution fitting and normality testing processes with functions that can be applied to full datasets, rather than working with one column at a time. In this article, we will dive deeper into the different functions currently available in the DSML toolkit, along with additional resources, and example applications in Spotfire.
Prerequisites
To begin using any of these functions, it’s important to note that any input data must be numeric. If you are working with nonnumeric data and plan on using any of the discussed functions, please use an encoding method of your choice on the data.
The following distributions are currently available for each of the discussed functions, unless otherwise specified. In parentheses is the formatting with which each distribution should be written when being used as an input for a function. The formatting in parentheses matches the name of the respective distribution object in the scipy.stats Python module.

Continuous distributions:
 Normal (norm)
 Exponential (expon)
 Gamma (gamma)
 Beta (beta)
 Uniform (uniform)
 Double Weibull (dweibull)
 Pareto (pareto)
 T (t)
 Lognormal (lognorm)

Discrete distributions:
 Poisson (poisson)
 Binomial (binom)
 Geometric (geom)
The following packages are required to use this portion of the DSML toolkit: numpy, pandas, matplotlib, scipy, and distfit.
Normality Testing
The DSML toolkit provides two general methods of normality testing  visual and statistical.
Visual Normality Testing
Two types of visualizations play a key role in normality testing  the histogram and the QuantileQuantile (QQ) plot. With the DSML toolkit, you can easily create the components needed to build these visualizations directly in Spotfire, or have premade plots returned in a python environment of your choice.
For histograms, the function will return a table containing probability density function (pdf) values at evenly spaced intervals in the original data for each input column, which can be used to plot the probability distribution curve. When overlaying the probability distribution curve over the histogram, you can visually determine how well your data follows a normal distribution.
For QQ plots, the function will return a table containing the theoretical quantiles if the data followed a normal distribution and the observed quantiles for each column of data. These quantiles can be plotted with a scatter plot, visualizing how the observed quantiles compare to the theoretical quantiles. If the quantiles are similar, they will roughly follow a line at y=x, which is often overlaid on QQ plots, indicating that the data may fit a normal distribution.
The images below show example premade outputs, which can optionally be returned by the function, and an example of how the output values can be used to produce histograms and QQ plots directly in Spotfire.
Statistical Normality Testing
Normality can also be tested for using statistical tests; in the DSML toolkit, the ShapiroWilk test and the AndersonDarling test. In summary, the ShapiroWilk test is a test of normality, and the AndersonDarling test tests whether a set of data is likely to have come from a particular distribution–in this case, a normal distribution. You can read more about the ShapiroWilk test here, and the AndersonDarling test here. Using the DSML toolkit, you can conduct both statistical tests on several columns of data using one function call, and have the results returned in a table that can be displayed in Spotfire.
Code Examples
import spotfire_dsml.distribution_fitting.distribution_fitting as dfit
# will return output dataframe to create qq plots and parameters, but not premade plots
df, params = dfit.visual_normality_testing(data = input_data, plot_type = “qq”, return_params = True, return_plots = False)
# will return output dataframe with results
df = dfit.statistical_normality_testing(data = input_data, alpha = 0.05)
Parameter Estimation
Parameter estimation is an extremely important step in the process of distribution fitting. With the DSML toolkit, you can estimate parameters not just for multiple columns, but for multiple distributions as well. For example, if one column in your data should be fit to a normal distribution, while another column should be fit to a gamma distribution, this function allows you to estimate all of the distribution parameters necessary with a single function call. Then, a table with each column and its respective distribution and estimated parameters will be returned, which can be displayed in Spotfire. The output of the parameter estimation function can also be used as an input for the statistical distribution fitting function, which will be discussed below.
Code Examples
import spotfire_dsml.distribution_fitting.distribution_fitting as dfit
# parameter estimation for one distribution, ideal for using in Spotfire
df = dfit.estimate_population_parameters(data = input_data, distribution = “gamma”)
# parameter estimation for multiple distributions
# dictionary with multiple distributions to estimate parameters for, following format of {key=column:value=distribution}
distribution_dict = {“feature_1”:”norm”,
“feature_2”:”beta”,
“feature_3”:”gamma”,
“feature_4”:”lognorm”,
“feature_5”:”t”,
“feature_6”:”dweibull”}
df = dfit.estimate_population_parameters(data = input_data, distribution = distribution_dict)
Distribution Fitting
Similar to the normality testing functions, the DSML toolkit provides two general methods of distribution fitting  visual and statistical.
Visual Distribution Fitting
Once again, histograms and QQ plots are the focus of the visual distribution fitting function. In the case of continuous distributions, histograms and QQ plots provide the same insights as those created to check for normality, but are just using a different distribution to produce the probability density function (pdf) curve values and/or the theoretical quantiles. For discrete distributions, a QQ plot cannot be created, and in the histogram, instead of plotting the probability density function values, you need to plot the probability mass function (pmf) values. The pmf values are calculated for each unique value in the data, and instead of being plotted as a curve, can be plotted with points and/or a line connecting these points. Please refer to the example Spotfire application for more instructions on how to generate these plots within Spotfire.
Statistical Distribution Fitting
For nonnormal continuous distributions and discrete distributions, either the KolmogorovSmirnov test or the ChiSquare test is applied to test for a well fitting distribution. Both test whether a set of data is likely to have come from a particular distribution, with the KolmogorovSmirnov test being applied to nonnormal continuous distributions, and the ChiSquare test being applied to discrete distributions. These tests can be applied to several columns, with the results being returned in a table that can be displayed in Spotfire. You can read more about the KolmogorovSmirnov test here, and the ChiSquare test here.
Code Examples
import spotfire_dsml.distribution_fitting.distribution_fitting as dfit
# will return output dataframe to create histograms, parameters, and premade plots
df, params = dfit.visual_distribution_fitting(data = input_data, distribution = “poisson”, plot_type = “histogram”, return_params = True, return_plots = True)
# will return output dataframe with results
# input for dist_params matches output from estimated_population_parameters
df = dfit.statistical_distribution_fitting(data = input_data, dist_params = estimated_params, alpha = 0.05)
Distribution Prediction
If you don’t have a desired distribution that you’d like to fit your data to, you can use the DSML toolkit to predict which distribution best fits your data, along with the estimated parameters. This function uses residual sum of squares (RSS) as the goodness of fit measure used to compare input data to a variety of theoretical distributions. You can read more about the residual sum of squares here.
Please note that this function should only be used with continuous variables, as only the continuous distributions above are compared in this function.
Code Example
import spotfire_dsml.distribution_fitting.distribution_fitting as dfit
# will output dataframe with each column from input_data, the best fit distribution, RSS value, and parameters
df = dfit.predict_best_distribution(data = input_data)
Probability Prediction
Once you have a well fitting distribution and parameters for your data, you can continue to utilize that information by predicting the probability of new data occurring within that distribution. With a column of new data, along with your distribution and parameters, this function will return a table of the new data and each value’s predicted probability, which can be displayed directly in Spotfire. For continuous distributions, the probability density function is used to calculate the probability, while the probability mass function is used for discrete distributions.
Code Example
import spotfire_dsml.distribution_fitting.distribution_fitting as dfit
# will output dataframe with original column values and predicted probabilities
preds = dfit.predict_proba(data = input_data[“feature_1”], distribution = “norm”, params = “12, 2”)
Example Workflow
Below is an example workflow demonstrating when to utilize each of the functions mentioned in this article.
Additional Resources
For more details on the DSML Toolkit for Python, check out this community article. Example Spotfire applications, including the example application mentioned earlier utilizing different functions in the toolkit can be downloaded from the Exchange page here.
Recommended Comments
There are no comments to display.