Performing Correlation Analysis using Python data functions in Spotfire

Assessing the correlation between data in your data set(s) is an important step when following the data science process to build machine learning and AI models. It allows you to assess whether variables/features should or can be removed due to being highly correlated to other variables/features as well as being able to assess which variables/features are related to the data points you wish to predict. This can be achieved easily and interactively within Spotfire using Python data functions. In this wiki example, we will build a Spotfire dashboard that allows a user to perform 3 types of correlation analysis all interactively and visually assess these correlation results.

To see the whole process, please watch the following Dr Spotfire session:

The Python Code

Below is the code and the configuration for generating correlation scores using the pandas package. Pandas has 3 inbuilt correlation scoring methods: Spearman, Kendall, and Pearson. You will need to register a new data function, select Python as the type and then paste in this code. See this video for a guide on how to set up data functions in Spotfire (particularly around 10:30 minutes in):

This code will compare all columns with all other columns in your data frame/data table passed into Spotfire:

"""
[Exploratory] Column Correlation
October 2021
Version: 1.2.0
datascience@tibco.com

Calculates the correlation coefficients between columns of data. Multiple 
correlation methods are available which are: Pearson, Spearman and Kendall.

Inputs
----------
df : data table
    Data table containing the columns to be tested for correlation
correlation_method: String
    Correlation method to be used. Must be one of: Spearman, Kendall or Pearson
encode_strings: Boolean
    (Optional) Whether columns of strings should be encoded to allow for correlation calculation. If False, or omitted, then string columns are ignored
selected_features : column (Optional)
    Column containing the names of columns in your data table to be tested for 
    correlation i.e. to select a subset of columns instead of testing all columns
    in the data

Outputs
-------
output_corr_df : data table
    Data tabel containing all correlation scores in the format of {Column 1, 
    Column 2, Correlation Score, Correlation Method}

Packages Required
-------
pandas
scikit-learn

"""

import pandas as pd
from sklearn.preprocessing import LabelEncoder

## check correlation method is valid
valid_correlation_methods = ['pearson','kendall','spearman']
if any([method == str(correlation_method).lower() for method in valid_correlation_methods]) == False:
    raise ValueError("Invalid correlation method specified. Please use one of: Pearson, Kendall or Spearman")


## Filter data table passed from Spotfire to only columns of interest based
## upon another column input i.e. from feature selection marking

## If nothing is passed in default to performing correlation on all columns
if 'selected_features' in globals() and selected_features is not None:
    if len(selected_features.unique()) >= 2:
        feature_names = selected_features.unique()
        ## filter data frame down to only columns of interest
        features_df = df.loc[:,feature_names]
    elif len(selected_features.unique()) == 0:
        features_df = df.copy()
        feature_names =  features_df.columns
    else:
        raise ValueError("Not enough features selected (" + str(len(selected_features.unique())) + " were supplied). The minimum is 2.")    
else:
    features_df = df.copy()
    feature_names =  features_df.columns
    
## Handle string columns if required
if 'encode_strings' in globals() and encode_strings is not None:
    if encode_strings == True:
            label_encoder = LabelEncoder()
            features_df = features_df.apply(lambda x: label_encoder.fit_transform(x) if x.dtype == 'object' else x)

## Check we have some columns to compare - if not, and empty data frame is returned
if len(feature_names) >= 2:
    ## run correlation
    output_corr_df = features_df.corr(method=correlation_method.lower())
    output_corr_df['Column1'] = output_corr_df.index
    output_corr_df = output_corr_df.melt(id_vars=['Column1'], var_name='Column2', value_name='Correlation Score')
    ## add in correlation method column for reference
    output_corr_df["Correlation Method"] = correlation_method
    ## remove same column comparison
    output_corr_df = output_corr_df[output_corr_df.Column1 != output_corr_df.Column2]
else: ## return empty data frame
    output_corr_df = output_corr_df.append(pd.DataFrame({"Column1": [""], 
                                            "Column2": "",
                                            "Correlation Score": 0, 
                                            "Correlation Method": correlation_method
                                            }))

# Copyright (c) 2021. TIBCO Software inc.

Data Function Inputs

For this data function we have specified 3 inputs which are described below:

df - the table containing the columns to be analyzed for correlation
encode_strings - (Optional input) true/false on whether to encode categorical and string columns to include these columns in the correlation analysis
selected_features - (Optional input) a column from a table that has the names of the columns we want to include in the correlation analysis.
correlation_method - a String value that states what correlation method to use. Must be one of: Spearman, Kendall, or Pearson

Data Function Outputs

There is a single output which is our table of results for the feature importance scores that a user can then use to select features to use:

output_corr_df - this is a table with each column's names compared and their correlation score (from -1 to 1)

Optional Inputs in Python Data Functions

Note the line of code in the above script:

if 'selected_features' in globals() and selected_features is not None

This allows us to make inputs optional as it checks for the name of your input variable in Python globals. If a user doesn't pass this optional input, using an If a statement like the above is a way to check whether it was passed or not. See more tips and tricks in Python like this in this article.

Making the Correlation Assessment Dynamic

We can make the correlation assessment react to the selection of columns in another chart by simply altering the input settings for the selected_features input, to react to filtering or marking. In this example below, the marking is based upon the marking on the feature selection bar charts as shown in the previous article on performing feature selection in Spotfire using Python, and in the

covering this topic:

Now the columns passed into the python function are limited to those marked on another. In this way, we can control what columns we are assessing for correlation before proceeding to use these data further. The final display in the Dr Spotfire session is shown below:

In this example, we are using a drop-down control in a text area to set the correlation method to pass to the Python data function, and then an action control button to trigger the data function running. The correlation results are shown using a table, and a heatmap to make the assessment and identification of hotspots easy and intuitive.

Additional Notes

The corr function in Pandas automatically removes non numerical columns from the comparison. However, this can be circumvented by encoding your categorical and string columns using some form of encoder function. In the code example above, a label encoder is used. However, this is just a simple example and other encoders are likely to provide better encoding such as count, binary or hash encoders.

Sign In

Performing Correlation Analysis using Python data functions in Spotfire

The Python Code

Data Function Inputs

Data Function Outputs

Optional Inputs in Python Data Functions

Making the Correlation Assessment Dynamic

Additional Notes

Table of contents

User Feedback

Recommended Comments

Industries