Jump to content
  • Tips and Tricks for Working with Python in Spotfire


    This article contains Tips and Tricks for Working with Python in Spotfire

    Introduction

    When you create an analysis using Spotfire®, you can enhance the capabilities through the use of data functions. Data functions are a method of calling languages such as Python, TERR®, Open Source R or even Matlab. This allows Spotfire users to interactively call Python through the Spotfire client, or through the web player to greatly enhance their analytics. This guide serves as a summary of tips and tricks which can be used to aid in the development of Python data functions, as well as how they can be implemented in your analytics to enhance the insights and data science in Spotfire.

    To learn more about data functions in Spotfire, please visit the Spotfire Enablement Hub  

    Looking for tips on using R/TERR® data functions:

    Setup and Background

    For Spotfire 7.13 to 10.6, install the Python Data Function Extension. Spotfire 10.7+ has native support for Python Data Functions.

    1. Spotfire and Python Data Type Mapping

    When data in Spotfire is passed to a Python data function, the Spotfire data types are converted to Python data types. The reverse also happens when data is returned from Python to Spotfire. View the data type conversions table in the official Spotfire Python documentation.

    2. Other Setup Tips

    For tips on how to use a different Python interpreter than the Spotfire default one, how to use a different package repository, or how to install packages when behind a firewall or proxy, then please read the official Spotfire Python documentation found here.

    Debugging

    1. Debugging Python data functions using an external IDE

    For developing Python data functions it is often very useful, especially as a script grows larger, to be able to run and debug it in popular IDEs such as PyCharm and Visual Studio Code. To do this for a Python data function, follow this process:

    Add the following code to your data function after the library imports. This code will export a pickle file with all the inputs to the data function. Note the file location which you may want to alter:

    import pickle
    import pandas as pd
    
    #Note: place this code after imports (need to import pickle)
    #uncomment and change file path to write all input parameters to a pickle file
    def is_input_parameter(key, value):
    	if key == '__builtins__':
    		return False
    	try:
    		if value and ('module' in value or 'class' in value):
    				return False
    	except:
    		if not((isinstance(value, pd.Series)) or (isinstance(value, pd.DataFrame))):
    			return False
    	return True
    	
    input_param = {k: v for k, v in globals().items() if is_input_parameter(k, v)}
    
    # You may want to alter the file path
    with open("C:/Temp/data_function_inputs.pkl", 'wb') as file:
    	pickle.dump(input_param, file)
     

    Now run your data function once (even if it fails), which will create all your inputs sent from Spotfire to Python as a pickle file. Then copy your data function code from Spotfire into a python file open in Visual Studio Code or PyCharm (or equivalent). Add the following code to the top of this Python script after the library imports:

    import pickle
    
    # load input parameters in Python IDE:
    # remember to alter the file path if you altered this in the first script
    input_param = pickle.load(open('C:/Temp/data_function_inputs.pkl', 'rb'))
    globals().update(input_param)
     

    After running this code in your IDE, all the objects that your Spotfire Python data function takes as inputs will be loaded into your Python instance. This means you can run your Python code using the same objects that Spotfire passes to Python, making it easy to develop and debug your script. Once your script is ready, you can copy and paste it back into your data function script in Spotfire.

    2. Configure your external Python IDE to use Spotfire's Python interpreter

    In IDEs such as Visual Studio Code, it is often an option to choose which Python interpreter you run code as. This means you can configure your IDE to point to Spotfire's own Python instance so you can replicate how Spotfire will run your data function more closely while running the code in an external IDE. It also means that the libraries will match those that are installed in your Spotfire Python instance. To get the path to Spotfire's own Python interpreter, go to the Tools menu in Spotfire and select 'Python Tools'. From there you will see the path Python Spotfire uses and a button to copy this path. You can then use this to point your external IDE to Spotfire's own Python instance:

    image.png.9e1bc1f7996bf1f34960137f2ceb977c.png

    Examples:

    1. Add python interpreter to Visual Studio Code
    2. Add python interpreter to PyCharm

    3. Using print statements in debugging

    The print statement is one of the important quick tools when it comes to tracking and debugging your code. With print commands you can do for instance:

    • Checking the content of a variable, table, etc.
    • Checking the data type of a variable.
    • Checking the data frame shape (This is useful in ML applications)

    below is an example of how to perform the tasks above:

    image.thumb.png.431f670e1ae85a9d849e137694cf4c43.png

    In order to make sure that the function debugging is on, make sure to enable it by Tools -> Options -> Data Functions -> check Enable Data Function Debugging

    4. Using python classes for better code debugging and re-usability

    Classes are a big help when it comes to making our code reusable and maintainable. They also make the debugging process easier. Let's imagine a use case you have several csv files each containing pressure and temperature reading about a specific pump in a pump station. You want to create a Pump() object that has the following attributes/methods:

    Attributes:

    • Pump name
    • Pump location (which is included in the csv file name)
    • Pump size (included in the csv file name)
    • Source file location

    Methods:

    • Find the max pressure value
    • Find the max temp value
    • Find the correlation between pressure and temp
    • Find summary stats
    class pump():
        def __init__(self, path):
            self.file = file
            self.location = self.location()
            self.size = self.size()
            self.name = self.name()
            self.data = self.data()
            self.max_press = self.max_press()
            self.max_temp = self.max_temp()
            self.corr = self.corr()
            self.sumstat = self.sumstat()
                    
        def location(self):
            return self.file.split('_')[1]
        
        def size(self):
            return self.file.split('_')[2].replace('.csv', '')
        
        def name(self):
            return self.file.split('_')[0]
        
        def data(self):
            import pandas
            return pandas.read_csv(self.file)
        
        def max_press(self):
            return max(self.data['press'])
        
        def max_temp(self):
            return max(self.data['temp'])
        
        def corr(self):
            from scipy.stats import pearsonr
            return pearsonr(self.data['press'], self.data['temp'])[0]
            
        def sumstat(self):
            return self.data.describe()
    
    
    file = r'C:\Users\aalattar\Downloads\pump01_houston_130kpa.csv'
    p = pump(file)
    print(p.file)
    print(p.location)
    print(p.size)
    print(p.name)
    print(p.data)
    print(p.max_press)
    print(p.corr)
    print(p.sumstat)
     

    image.png.cdfa1aa98f8e03d5d1b93db73263dbb6.png

    5. Understanding the Notification Details to identify error location

    The notification details information not only shows debug information but as well as the actual error reported by the Python code. It also includes traceback which helps the specific line in your data function code that is reporting the error. In this error, it will display which lines are causing the error. Look for the line that mentions <module> as this is the origin of the error. If this line is a function call, then look for the line that references the actual function name. See the example below:

    In this example, the <module> error is on line 233. In this data function line, 233 is actually a call to a function called select_best_model and the line in the data function causing the actual error is 203:

    image.thumb.png.0dfada6c04cbcb323e036a32b902c077.png

    Enhanced f(x) Flyout and End User Experience

    Spotfire 11.0 introduced the f(x) flyout to facilitate access to analytics tools and data functions. This allows a seamless interface to register and run library data functions.

    1. Clear Display Names and Documentation

    Proper documentation and instructions are crucial for others to understand and use data functions. The flyout will display the data function name, data function description, display names of each input and output parameter, and descriptions of each input and output parameter. To facilitate their use, ensure that the display names are clean and concise plus follow the terminology for the targeted industry or field. For example, while you might use 'input_df' in your code, it's better to rename it to 'Input Data'.

    2. Optional Parameters

    Some data functions, such as machine learning models, will have many input parameters so that the user has full flexibility to best configure or tweak the parameters. To avoid having data functions that are cumbersome to configure, make only the most essential parameters required. It helps to ask: what are the foundational inputs needed from the user so that the function can run and return meaningful results? You might not have the best model without testing various parameters but you'll have some starting point. For example, in this general TensorFlow regression and classification data function, only the input data, target column name, and task are required (appear as 'not configured' in the flyout); the remaining parameters can be left blank or overridden.

    image.thumb.png.afa3780ea247a324d621e8c1842846d9.png

    To make parameters optional, you will need to mark them as not required and often provide some default. Each input parameter in a data function is designated as "Required" or not. To make a parameter not required, leave the "Required Parameter" box unchecked. 

    image.thumb.png.76922039341b54af9f051839c0cb53f7.png

    There are standard code checks to check if these optional input variables have been specified by the user, and if not, then provide a default value. All non-required parameters appear as global variables and are set to null if no input is given.

    If the parameter is a Python primitive type (this covers most input parameters of type 'Value'), the following code will take the user's input if it's provided otherwise continue with a default:

     #save model to existing file path if provided otherwise save to a default file path
     #file_path is a string Value
     if not ("file_path" in globals() and file_path != None and len(file_path) > 0):
         file_path = 'best_model.h5'
     

    If the input parameter is of type 'Column' or types 'Table':

    #check for optional Column
    #if user does not provide a subset of columns to use, use all the columns in the input data
    if not ("selected_columns" in globals() and isinstance(selected_columns, pd.Series)):
    	selected_columns = pd.Series(list(input_data.columns))
    
    #check for optional Table
    #if user does not provide a new dataset for scoring, model is scored on training data
    if not ("new_data" in globals() and isinstance(new_data, pd.DataFrame)):
    	new_data = training_data
     

    NOTE: as of Spotfire 11.8,  optional parameters can now be folded away/hidden and automatically moved to the bottom of the list of parameters using the 'Always show required inputs first' checkbox as shown below: 

    image.thumb.png.7ac23e0d9774217ba320647579059d54.png

    3. Design of Input and Output Parameters

    Taking into account the design of input and output parameters can facilitate work for both Python data function developers and end users building dashboards. While there is a multitude of designs depending on the organization of your data and use case, here are some common scenarios.

    If your features or columns of interest are few or set in number or you are working with columns from different tables, consider passing each column as its own input parameter of type 'Column'. If you have many or variable numbers of columns of interest, have them all under a single input 'Data table'.

    Also, consider how the output parameters will be used in the data canvas and subsequent visualizations. If you have model predictions outputted as a column, these can be added as a calculated column to any data table. To plot model predictions versus observed values in a line chart, scatter plot, etc., it helps to have a long data table with the input data and model prediction rows concatenated and designated ('actuals' versus 'predictions'). This can be done in both the data canvas or data function (although there might be data duplication if you do it within data functions).

    4. Suppressing warnings and error messages

    It sometimes is desirable to hide warnings and also error messages that are output from a Python data function. This can be achieved with the following additions to your data function script:

    • Hide warnings

     # at the top of the data function before other imports
     import warnings
     warnings.filterwarnings('ignore')
     

    More specific settings for warnings can be achieved also. See the warnings Python documentation: https://docs.python.org/3/library/warnings.html

    • Hide errors

    Hiding all errors may not be possible depending on which Python package is producing the error. However, errors can be reduced and/or suppressed using the following code:

     # add to the top of your data function
     import sys
     sys.tracebacklimit = 0 
     

    Advanced Uses

    1. Using a Model Produced in A Previous Data Function

    If you create an object such as a machine-learning model and you want to use that object in a second data function, you can use the pickle package to store the object in a document property in Spotfire. This way, the model can be set as inputs to other Python data functions. The following example demonstrates using the pickle package to manage this process. In the data function which creates the model object, the model object needs to be pickled and then set as an output of the data function:

     #From data function number 1
     import pickle
     modelObject = pickle.dumps(myObject) 
     
    In your data function configuration, set the modelObject to be an output of type Value:
    image.png.05a140deb9ef6548a25d43fdb3f89e16.png

    The binary data (called modelObject in the above example) can be stored in a Spotfire document property when the data function parameters are configured in Spotfire, and then loaded from that document property to the second data function as an input parameter:

    image.thumb.png.1563c1781812310bf2431296a9fd3f2b.png

    Then in the second Python data function script, this model can be easily restored:

     #To data function number 2
     import pickle
     myObject = pickle.loads(modelObject)
     

    This technique should work generically for almost any object that you might want to retain between data function calls, not just models.

    2. Formatting with an ID Column

    Adding in an optional ID column helps relate each row from a data function results table with the observation it came from. This is necessary if there is a one-to-many resulting table or if the function manipulates or needs to know the order of observations in inputs. Here are two ways of checking for and otherwise providing a valid ID column as an input parameter (the validity check is that all values in the column are unique). 

    If the ID column (id_column) is a 'String Value' that is in the main data table (input_df):

    #check if the ID column (id_column) is provided and valid in the data table model_data otherwise provide a default one
    #default ID starts from 0 and increments with subsequent rows
    if not ("id_column" in globals() and (id_column != None) and (id_column in model_data.columns) and model_data[id_column].nunique() != len(model_data)):
    	id_column = "ID"
    	model_data[id_column] = range(0, len(model_data))
    assert(model_data[id_column].nunique() == len(model_data))
    assert(id_column in model_data.columns)
     

    Common Issues and Errors

    Below is a list of messages and errors that may be returned when you try to run your Python data function. Common issues are often related to the data passed to the data function, such as passing an empty data table, and the Python is not set to handle this. The other common issue seen is trying to pass back an empty data frame from Python, or including invalid types such as lists in a data frame column.

    1. Cannot determine type / all values are missing

    This message points to the fact that either a column in your data frame is completely empty and therefore Spotfire cannot determine which type of column this is i.e. String, Numeric, etc. If Spotfire can't determine a type, an error is returned. To resolve this requires debugging your code to see why one of your columns is empty. There are two main methods to debug this: using simple print statements or debugging your code in an external IDE like visual studio. Also, see this StackOverflow question and answer. For example, add a print head statement before the last occurrence of your data frame to see how what is in each column:

     ## if my output data frame in my Python data function is called output_df
     print(output_df.head()) 
     

    2. ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

    This commonly occurs when you have a column in a data frame of an array or list type. For example:

     ## A list of lists - assuming we already have a data frame called df
     df['MyColumn'] = [[1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2]] 
     

    In this case, Spotfire can't convert a column where each item or row is a list rather than a distinct value. The solution here is to debug your outputs of the data function using simple print statements, such as using a printing a head() statement for your output data frame(s), or printing the types for each column and checking none are of type list or array:

     ## Check the first 5 rows to ensure no columns have lists or arrays
     print(df.head())
     
     ## And/or check dtypes
     print(df.dtypes)
     

    See the above guides on using simple print statements or debugging your code in an external IDE like the visual studio to help diagnose the issue

    3. Passing in empty data to Python: IndexError: single positional indexer is out-of-bounds 

    Errors of this type commonly are due to an input data table being empty. This can easily occur if you set your data function input table to react to a filtering scheme or marking, and therefore it is possible to not mark any data, or filter out all data. In this case, your Python script will react in different ways and the error may not match this exactly. So it is important to handle the situation where an empty data table is passed into your Python data function. A simple method to handle this is to check the length of your input data frame, and if it is 0 you can raise an exception or return a dummy data frame for the output table (assuming your data frame has an output of a data table). For example:

    ## if our data function input data table is called input_df
    ## and our expected output data table is called output_df
    
    ## check we have some rows of data
    if len(input_df) > 0:
        ## Run your code as expected as we have data
    else:
        ## An empty data frame as been passed in
        ## If we want to prevent an error from occurring we need to return the output_df as expected
        ## To do this we create output_df with a single row of data
        output_df = pd.DataFrame(values = {'Column A': ['No Data Passed In'], 'Column B': [0.0]})
     

     

    In the above example, we return a dummy data frame with a single row if no data was passed to the Python data function. This single row is added (rather than just having a completely empty data frame), so that Spotfire knows that data type each column should be as an output. So, if the expected output from the Python data function was to output a data frame with 2 columns called Column A and Column B, and Column A is a String column, and Column B a Real column, then in our dummy data frame we add these two columns and a single string value and a single numeric value; 'No Data Passed in' and 0.0 respectively.

    top_10_tricks_for_terr_data_functions_0.pdf

     

     


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...