Jump to content
  • Performing Feature Selection using Python data functions in Spotfire


    An important step in the data science life cycle is to assess the variables or features in your data and determine which are the most important and appropriate for building your machine learning and/or AI models. This can be achieved easily and interactively within Spotfire using Python data functions. In this wiki example, we will build a Spotfire analytics application that allows interactive configuration of the feature selection in terms of the data being passed, as well as how the feature selection is performed. This will utilize Python via Spotfire's Python data function integration.

    Introduction

    To see the whole process, please watch the following Dr. Spotfire session: 

    Prerequisites

    • Spotfire 10.7 if required for the code shown below. However, users of earlier Spotfire versions can also follow this example if they have the community Python Data Function Extension (for Spotfire 7.13 to 10.6) which be installed on your Spotfire instance or server for 10.6 and below.
    • Install the sklearn Python package into your Python instance. This can be done through the Spotfire menu by going to Tools->Python Tools->Package Management. Search for sklearn and click install. 

    The Python Code

    Below is the code and the configuration for doing a simple feature selection using the sklearn package and the decisiontreeregressor. You will need to register a new data function, select Python as the type and then paste in this code. See this video for a guide on how to set up data functions in Spotfire (particularly around 10:30 minutes in):

    The code below is to perform one feature selection:

    ## import packages
    import pandas as pd
    
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.preprocessing import LabelEncoder
    
    
    ########
    ## In this section we are filtering the dataset to only contains the training data i.e. excluding the test data
    ## You can comment this out if you just want to use all data in the feature importance
    
    ## filter to only training data
    model_data = model_data[model_data[training_column] == 'Train']
    ## remove the training column as its no longer needed
    model_data.drop(training_column, axis=1, inplace=True)
    
    ########
    
    ## encode text columns
    ## this could be changed to other encoders such as one hot but for simplicity we will use label
    label_encoder = LabelEncoder()
    model_data = model_data.apply(lambda x: label_encoder.fit_transform(x) if x.dtype == 'object' else x)
    
    
    ## Get the variables / features out by removing the target column
    X = model_data.drop(target_column, axis=1)
    ## Now separate out the target column we are wanting to predict
    Y = model_data.pop(target_column)
    
    # define the DecisionTreeRegressor model
    model_dt = DecisionTreeRegressor()
    
    # fit the model
    model_dt.fit(X, Y)
    
    # return importance results 
    importance_results_dt = pd.DataFrame({'Feature': X.columns, 'Importance': model_dt.feature_importances_})
     

    Data Function Inputs

    For this data function we have specified 3 inputs which are described below:

    • model_data - the data that contains the column we want to predict and all the columns we want to test for feature selection against. This input is of type Table.
    • target_column - this is a string input which holds the name of the column which is your target i.e. the values you want to predict. By passing this a string which we can refer to in the Python code, we can make this dynamic and react to controls in a Text Area for example in Spotfire
    • training_column - Optional - this is a string input which holds the name of the column which is used to split your data into training data, and testing data. In the script above the value for a row which is to be used for training has the value 'Train' hardcoded into it. You can change this to suit your data, or even make it dynamic by adding another data function input. If you do want to split your data, and wish to use all the data, comment this section out of the script above and do not add this as an input.

    Data Function Outputs

    There is a single output which is our table of results for the feature importance scores that a user can then use to select features to use

    • importance_results_dt - this is the table results of the feature importance scores we have made from the model fit object. This returned to Spotfire as a table.

    Making the Data Science Feature Selection Dynamic

    The strength of this approach is not only the interactive visuals that can be generated from the feature selection but the ability to make the feature selection a dynamic process.

    By adding a Text Area in Spotfire we can create controls that allow the user to control how the feature selection is done. Here is an example of doing this:

    python_feature_selection_1.thumb.png.c342efc7bf77538ce4715312e1c93971.png

    In the example above, a user can choose their target column (which is passed to the data function), and the predictor columns which are used to filter the data sent to the data function. 

    To create the target drop-down, we just edit the Text Area and add a property control of the type drop-down. Here is how this is configured from the example above:

    python_feature_selection_2.png.145d92a3361dbb9d74f5a440c37c8f1b.png

    Notice here we set a Document Property to hold whatever column is selected by the user, and we pass this to our data function (see later). We then populate the drop-down with the column selection from our data table. You can add expressions in the selectable columns area to hide certain column types or names for example.

    See these guides on creating property controls in Text Areas:

    We can create a listbox (multi-select) that allows users to select the predictor columns to filter the data for our feature selection also using a similar technique. 

    Finally, the Determine Importance button is an action control that triggers our Python Data Function:

    python_feature_selection_3.png.67467f235cb452fe5b70beccae2f764a.png

    Configuring the Python Data Function

    In the Inputs and Outputs sections above we stated what inputs and outputs the script expects. Below is how the data function is then parameterized to use the example text area and tool displayed above:

    The key parameter to define is the data that is passed to the Python data function. Below is the configuration used for this. Here we use an expression to define which columns of data will be passed:

    python_feature_selection_4.thumb.png.4d95f07bda948b934916a190918d5452.png

    If you click Edit on the right and double click the Document Property you created to hold the selection from the column listbox (multiple) created before in the Text Area, it will add the code above to your expression for you. For example:

     $map("[Melbourne Laps].[${selectedPredictors}]", ",")
     

    In the example above Melbourne Laps is the name of the data table we are applying this selection expression to.

    You can then append on extra expressions based upon other inputs and document properties, so we add the document property which holds the name of the target column and the training column. This results in the final expression below:

     $map("[Melbourne Laps].[${selectedPredictors}]", ","),[Melbourne Laps].[${selectedTarget}],[Melbourne Laps].[DataUsage]
     

    This produces a list of data table column names the expression will select from the data to be passed to our (Python) data function. An alternative is just to send all data to the Python data function, and simply filter the columns used inside the Python code. This will work as well but does mean sending unnecessary data which is not as efficient (and may be slower as your data gets larger).

    The other two parameters needed for this script, we simply specify values based upon Document Properties, Column Properties, or hard coded values:

    python_feature_selection_5.png.2eb6d0a078f5fb189b604b4820ad50fe.png

    For the outputs, we just need to define the table name(s) to be created/updated by the Python code:

    python_feature_selection_6.png.68bed2456f5e12eb8ce0b97c3ed684c1.png

    Once configured, we can build visuals from this data table returned by Python as shown earlier in this article.

    Adding More Feature Selection Methods

    One advantage of being able to visualize the data from Python is that we can return multiple tables. This means we can run multiple methods of feature selection for comparison. Below is a script which adds a second method to the Python script. 

    import pandas as pd
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.preprocessing import LabelEncoder
    
    ## filter to only training data
    model_data = model_data[model_data[training_column] == 'Train']
    ## remove the data usage column as its no longer needed
    model_data.drop(training_column, axis=1, inplace=True)
    
    ## encode text columns
    ## this could be changed to other encoders such as one hot but for simplicity we will use label
    label_encoder = LabelEncoder()
    model_data = model_data.apply(lambda x: label_encoder.fit_transform(x) if x.dtype == 'object' else x)
    
    ## input data
    X = model_data.drop(target, axis=1)
    Y = model_data.pop(target)
    
    # define the model
    model_dt = DecisionTreeRegressor()
    model_rf = RandomForestRegressor(n_estimators=number_of_trees, max_depth=tree_max_depth)
    #model = XGBClassifier()
    
    # fit the model
    model_dt.fit(X, Y)
    model_rf.fit(X, Y)
    
    # get importance
    importance_results_dt = pd.DataFrame({'Feature': X.columns, 'Importance': model_dt.feature_importances_})
    importance_results_rf = pd.DataFrame({'Feature': X.columns, 'Importance': model_rf.feature_importances_})
     

    Notice there are two new inputs and one new output:

    • Inputs: number_of_trees and tree_max_depth
    • Outputs: importance_results_rf

    Using the same techniques as shown above you can alter your data function to have these inputs/outputs and use the same Text Area techniques to allow a user to control these options if required. This can then be used to produce an analysis such as this:

    python_feature_selection_7.thumb.png.9f9251fce01609bb397371ccf49d41ca.png

    Comparing the results from two different feature selection methods


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...