Performing Feature Selection using Python data functions in Spotfire

An important step in the data science life cycle is to assess the variables or features in your data and determine which are the most important and appropriate for building your machine learning and/or AI models. This can be achieved easily and interactively within Spotfire using Python data functions. In this wiki example, we will build a Spotfire analytics application that allows interactive configuration of the feature selection in terms of the data being passed, as well as how the feature selection is performed. This will utilize Python via Spotfire's Python data function integration.

Introduction

To see the whole process, please watch the following Dr. Spotfire session:

Prerequisites

Spotfire 10.7 if required for the code shown below. However, users of earlier Spotfire versions can also follow this example if they have the community Python Data Function Extension (for Spotfire 7.13 to 10.6) which be installed on your Spotfire instance or server for 10.6 and below.
Install the sklearn Python package into your Python instance. This can be done through the Spotfire menu by going to Tools->Python Tools->Package Management. Search for sklearn and click install.

The Python Code

Below is the code and the configuration for doing a simple feature selection using the sklearn package and the decisiontreeregressor. You will need to register a new data function, select Python as the type and then paste in this code. See this video for a guide on how to set up data functions in Spotfire (particularly around 10:30 minutes in):

The code below is to perform one feature selection:

## import packages
import pandas as pd

from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import LabelEncoder


########
## In this section we are filtering the dataset to only contains the training data i.e. excluding the test data
## You can comment this out if you just want to use all data in the feature importance

## filter to only training data
model_data = model_data[model_data[training_column] == 'Train']
## remove the training column as its no longer needed
model_data.drop(training_column, axis=1, inplace=True)

########

## encode text columns
## this could be changed to other encoders such as one hot but for simplicity we will use label
label_encoder = LabelEncoder()
model_data = model_data.apply(lambda x: label_encoder.fit_transform(x) if x.dtype == 'object' else x)


## Get the variables / features out by removing the target column
X = model_data.drop(target_column, axis=1)
## Now separate out the target column we are wanting to predict
Y = model_data.pop(target_column)

# define the DecisionTreeRegressor model
model_dt = DecisionTreeRegressor()

# fit the model
model_dt.fit(X, Y)

# return importance results 
importance_results_dt = pd.DataFrame({'Feature': X.columns, 'Importance': model_dt.feature_importances_})

Data Function Inputs

For this data function we have specified 3 inputs which are described below:

model_data - the data that contains the column we want to predict and all the columns we want to test for feature selection against. This input is of type Table.
target_column - this is a string input which holds the name of the column which is your target i.e. the values you want to predict. By passing this a string which we can refer to in the Python code, we can make this dynamic and react to controls in a Text Area for example in Spotfire
training_column - Optional - this is a string input which holds the name of the column which is used to split your data into training data, and testing data. In the script above the value for a row which is to be used for training has the value 'Train' hardcoded into it. You can change this to suit your data, or even make it dynamic by adding another data function input. If you do want to split your data, and wish to use all the data, comment this section out of the script above and do not add this as an input.

Data Function Outputs

There is a single output which is our table of results for the feature importance scores that a user can then use to select features to use

importance_results_dt - this is the table results of the feature importance scores we have made from the model fit object. This returned to Spotfire as a table.

Making the Data Science Feature Selection Dynamic

The strength of this approach is not only the interactive visuals that can be generated from the feature selection but the ability to make the feature selection a dynamic process.

By adding a Text Area in Spotfire we can create controls that allow the user to control how the feature selection is done. Here is an example of doing this:

In the example above, a user can choose their target column (which is passed to the data function), and the predictor columns which are used to filter the data sent to the data function.

To create the target drop-down, we just edit the Text Area and add a property control of the type drop-down. Here is how this is configured from the example above:

Notice here we set a Document Property to hold whatever column is selected by the user, and we pass this to our data function (see later). We then populate the drop-down with the column selection from our data table. You can add expressions in the selectable columns area to hide certain column types or names for example.

See these guides on creating property controls in Text Areas:

How to use the text area

We can create a listbox (multi-select) that allows users to select the predictor columns to filter the data for our feature selection also using a similar technique.

Finally, the Determine Importance button is an action control that triggers our Python Data Function:

Configuring the Python Data Function

In the Inputs and Outputs sections above we stated what inputs and outputs the script expects. Below is how the data function is then parameterized to use the example text area and tool displayed above:

The key parameter to define is the data that is passed to the Python data function. Below is the configuration used for this. Here we use an expression to define which columns of data will be passed:

If you click Edit on the right and double click the Document Property you created to hold the selection from the column listbox (multiple) created before in the Text Area, it will add the code above to your expression for you. For example:

 $map("[Melbourne Laps].[${selectedPredictors}]", ",")

In the example above Melbourne Laps is the name of the data table we are applying this selection expression to.

You can then append on extra expressions based upon other inputs and document properties, so we add the document property which holds the name of the target column and the training column. This results in the final expression below:

 $map("[Melbourne Laps].[${selectedPredictors}]", ","),[Melbourne Laps].[${selectedTarget}],[Melbourne Laps].[DataUsage]

This produces a list of data table column names the expression will select from the data to be passed to our (Python) data function. An alternative is just to send all data to the Python data function, and simply filter the columns used inside the Python code. This will work as well but does mean sending unnecessary data which is not as efficient (and may be slower as your data gets larger).

The other two parameters needed for this script, we simply specify values based upon Document Properties, Column Properties, or hard coded values:

For the outputs, we just need to define the table name(s) to be created/updated by the Python code:

Once configured, we can build visuals from this data table returned by Python as shown earlier in this article.

Adding More Feature Selection Methods

One advantage of being able to visualize the data from Python is that we can return multiple tables. This means we can run multiple methods of feature selection for comparison. Below is a script which adds a second method to the Python script.

import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder

## filter to only training data
model_data = model_data[model_data[training_column] == 'Train']
## remove the data usage column as its no longer needed
model_data.drop(training_column, axis=1, inplace=True)

## encode text columns
## this could be changed to other encoders such as one hot but for simplicity we will use label
label_encoder = LabelEncoder()
model_data = model_data.apply(lambda x: label_encoder.fit_transform(x) if x.dtype == 'object' else x)

## input data
X = model_data.drop(target, axis=1)
Y = model_data.pop(target)

# define the model
model_dt = DecisionTreeRegressor()
model_rf = RandomForestRegressor(n_estimators=number_of_trees, max_depth=tree_max_depth)
#model = XGBClassifier()

# fit the model
model_dt.fit(X, Y)
model_rf.fit(X, Y)

# get importance
importance_results_dt = pd.DataFrame({'Feature': X.columns, 'Importance': model_dt.feature_importances_})
importance_results_rf = pd.DataFrame({'Feature': X.columns, 'Importance': model_rf.feature_importances_})

Notice there are two new inputs and one new output:

Inputs: number_of_trees and tree_max_depth
Outputs: importance_results_rf

Using the same techniques as shown above you can alter your data function to have these inputs/outputs and use the same Text Area techniques to allow a user to control these options if required. This can then be used to produce an analysis such as this:

Comparing the results from two different feature selection methods

Sign In

Performing Feature Selection using Python data functions in Spotfire

Introduction

Prerequisites

The Python Code

Data Function Inputs

Data Function Outputs

Making the Data Science Feature Selection Dynamic

Configuring the Python Data Function

Adding More Feature Selection Methods

Table of contents

User Feedback

Recommended Comments

Industries