Jump to content
  • Performing Correlation Analysis using Python data functions in Spotfire


    Assessing the correlation between data in your data set(s) is an important step when following the data science process to build machine learning and AI models. It allows you to assess whether variables/features should or can be removed due to being highly correlated to other variables/features as well as being able to assess which variables/features are related to the data points you wish to predict. This can be achieved easily and interactively within Spotfire using Python data functions. In this wiki example, we will build a Spotfire dashboard that allows a user to perform 3 types of correlation analysis all interactively and visually assess these correlation results.

    To see the whole process, please watch the following Dr Spotfire session: 

     

    The Python Code

    Below is the code and the configuration for generating correlation scores using the pandas package. Pandas has 3 inbuilt correlation scoring methods: Spearman, Kendall, and Pearson. You will need to register a new data function, select Python as the type and then paste in this code. See this video for a guide on how to set up data functions in Spotfire (particularly around 10:30 minutes in):

    This code will compare all columns with all other columns in your data frame/data table passed into Spotfire:

    """
    [Exploratory] Column Correlation
    October 2021
    Version: 1.2.0
    datascience@tibco.com
    
    Calculates the correlation coefficients between columns of data. Multiple 
    correlation methods are available which are: Pearson, Spearman and Kendall.
    
    Inputs
    ----------
    df : data table
        Data table containing the columns to be tested for correlation
    correlation_method: String
        Correlation method to be used. Must be one of: Spearman, Kendall or Pearson
    encode_strings: Boolean
        (Optional) Whether columns of strings should be encoded to allow for correlation calculation. If False, or omitted, then string columns are ignored
    selected_features : column (Optional)
        Column containing the names of columns in your data table to be tested for 
        correlation i.e. to select a subset of columns instead of testing all columns
        in the data
    
    Outputs
    -------
    output_corr_df : data table
        Data tabel containing all correlation scores in the format of {Column 1, 
        Column 2, Correlation Score, Correlation Method}
    
    Packages Required
    -------
    pandas
    scikit-learn
    
    """
    
    import pandas as pd
    from sklearn.preprocessing import LabelEncoder
    
    ## check correlation method is valid
    valid_correlation_methods = ['pearson','kendall','spearman']
    if any([method == str(correlation_method).lower() for method in valid_correlation_methods]) == False:
        raise ValueError("Invalid correlation method specified. Please use one of: Pearson, Kendall or Spearman")
    
    
    ## Filter data table passed from Spotfire to only columns of interest based
    ## upon another column input i.e. from feature selection marking
    
    ## If nothing is passed in default to performing correlation on all columns
    if 'selected_features' in globals() and selected_features is not None:
        if len(selected_features.unique()) >= 2:
            feature_names = selected_features.unique()
            ## filter data frame down to only columns of interest
            features_df = df.loc[:,feature_names]
        elif len(selected_features.unique()) == 0:
            features_df = df.copy()
            feature_names =  features_df.columns
        else:
            raise ValueError("Not enough features selected (" + str(len(selected_features.unique())) + " were supplied). The minimum is 2.")    
    else:
        features_df = df.copy()
        feature_names =  features_df.columns
        
    ## Handle string columns if required
    if 'encode_strings' in globals() and encode_strings is not None:
        if encode_strings == True:
                label_encoder = LabelEncoder()
                features_df = features_df.apply(lambda x: label_encoder.fit_transform(x) if x.dtype == 'object' else x)
    
    ## Check we have some columns to compare - if not, and empty data frame is returned
    if len(feature_names) >= 2:
        ## run correlation
        output_corr_df = features_df.corr(method=correlation_method.lower())
        output_corr_df['Column1'] = output_corr_df.index
        output_corr_df = output_corr_df.melt(id_vars=['Column1'], var_name='Column2', value_name='Correlation Score')
        ## add in correlation method column for reference
        output_corr_df["Correlation Method"] = correlation_method
        ## remove same column comparison
        output_corr_df = output_corr_df[output_corr_df.Column1 != output_corr_df.Column2]
    else: ## return empty data frame
        output_corr_df = output_corr_df.append(pd.DataFrame({"Column1": [""], 
                                                "Column2": "",
                                                "Correlation Score": 0, 
                                                "Correlation Method": correlation_method
                                                }))
    
    # Copyright (c) 2021. TIBCO Software inc.

     

    Data Function Inputs

    For this data function we have specified 3 inputs which are described below:

    • df - the table containing the columns to be analyzed for correlation
    • encode_strings - (Optional input) true/false on whether to encode categorical and string columns to include these columns in the correlation analysis
    • selected_features - (Optional input) a column from a table that has the names of the columns we want to include in the correlation analysis.
    • correlation_method - a String value that states what correlation method to use. Must be one of: Spearman, Kendall, or Pearson

    Data Function Outputs

    There is a single output which is our table of results for the feature importance scores that a user can then use to select features to use:

    • output_corr_df - this is a table with each column's names compared and their correlation score (from -1 to 1)

    Optional Inputs in Python Data Functions

    Note the line of code in the above script:

    if 'selected_features' in globals() and selected_features is not None

     

    This allows us to make inputs optional as it checks for the name of your input variable in Python globals. If a user doesn't pass this optional input, using an If a statement like the above is a way to check whether it was passed or not. See more tips and tricks in Python like this in this article.

    Making the Correlation Assessment Dynamic

    We can make the correlation assessment react to the selection of columns in another chart by simply altering the input settings for the selected_features input, to react to filtering or marking. In this example below, the marking is based upon the marking on the feature selection bar charts as shown in the previous article on performing feature selection in Spotfire using Python, and in the 

     covering this topic:

    correlation_analysis1.png.4a7bc9f06fd42b7246bf8d408bca57bb.png

    Now the columns passed into the python function are limited to those marked on another. In this way, we can control what columns we are assessing for correlation before proceeding to use these data further. The final display in the Dr Spotfire session is shown below:

    correlation_analysis2.thumb.png.a657593b59e0881558562631ce81cfdb.png

    In this example, we are using a drop-down control in a text area to set the correlation method to pass to the Python data function, and then an action control button to trigger the data function running. The correlation results are shown using a table, and a heatmap to make the assessment and identification of hotspots easy and intuitive.

    Additional Notes

    The corr function in Pandas automatically removes non numerical columns from the comparison. However, this can be circumvented by encoding your categorical and string columns using some form of encoder function. In the code example above, a label encoder is used. However, this is just a simple example and other encoders are likely to provide better encoding such as count, binary or hash encoders. 


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...