Introduction
When you create an analysis using Spotfire®, you can enhance the capabilities through the use of data functions. Data functions are a method of calling languages such as Python, TERR®, Open Source R or even Matlab. This allows Spotfire users to interactively call Python through the Spotfire client, or through the web player to greatly enhance their analytics. This guide serves as a summary of tips and tricks which can be used to aid in the development of Python data functions, as well as how they can be implemented in your analytics to enhance the insights and data science in Spotfire.
To learn more about data functions in Spotfire, please visit the Spotfire Enablement Hub
Looking for tips on using R/TERR® data functions:
- Check out these equivalent wiki articles: Tips on Debugging Spotfire® Enterprise Runtime for R Scripts in Spotfire® Data/Expression Functions
- Top 10 tricks for TERR data functions (please see resources at the end of the article)
Setup and Background
For Spotfire 7.13 to 10.6, install the Python Data Function Extension. Spotfire 10.7+ has native support for Python Data Functions.
1. Spotfire and Python Data Type Mapping
When data in Spotfire is passed to a Python data function, the Spotfire data types are converted to Python data types. The reverse also happens when data is returned from Python to Spotfire. View the data type conversions table in the official Spotfire Python documentation.
2. Other Setup Tips
For tips on how to use a different Python interpreter than the Spotfire default one, how to use a different package repository, or how to install packages when behind a firewall or proxy, then please read the official Spotfire Python documentation found here.
Debugging
1. Debugging Python data functions using an external IDE
For developing Python data functions it is often very useful, especially as a script grows larger, to be able to run and debug it in popular IDEs such as PyCharm and Visual Studio Code. To do this for a Python data function, follow this process:
Add the following code to your data function after the library imports. This code will export a pickle file with all the inputs to the data function. Note the file location which you may want to alter:
import pickle import pandas as pd #Note: place this code after imports (need to import pickle) #uncomment and change file path to write all input parameters to a pickle file def is_input_parameter(key, value): if key == '__builtins__': return False try: if value and ('module' in value or 'class' in value): return False except: if not((isinstance(value, pd.Series)) or (isinstance(value, pd.DataFrame))): return False return True input_param = {k: v for k, v in globals().items() if is_input_parameter(k, v)} # You may want to alter the file path with open("C:/Temp/data_function_inputs.pkl", 'wb') as file: pickle.dump(input_param, file)
Now run your data function once (even if it fails), which will create all your inputs sent from Spotfire to Python as a pickle file. Then copy your data function code from Spotfire into a python file open in Visual Studio Code or PyCharm (or equivalent). Add the following code to the top of this Python script after the library imports:
import pickle # load input parameters in Python IDE: # remember to alter the file path if you altered this in the first script input_param = pickle.load(open('C:/Temp/data_function_inputs.pkl', 'rb')) globals().update(input_param)
After running this code in your IDE, all the objects that your Spotfire Python data function takes as inputs will be loaded into your Python instance. This means you can run your Python code using the same objects that Spotfire passes to Python, making it easy to develop and debug your script. Once your script is ready, you can copy and paste it back into your data function script in Spotfire.
2. Configure your external Python IDE to use Spotfire's Python interpreter
In IDEs such as Visual Studio Code, it is often an option to choose which Python interpreter you run code as. This means you can configure your IDE to point to Spotfire's own Python instance so you can replicate how Spotfire will run your data function more closely while running the code in an external IDE. It also means that the libraries will match those that are installed in your Spotfire Python instance. To get the path to Spotfire's own Python interpreter, go to the Tools menu in Spotfire and select 'Python Tools'. From there you will see the path Python Spotfire uses and a button to copy this path. You can then use this to point your external IDE to Spotfire's own Python instance:
Examples:
3. Using print statements in debugging
The print statement is one of the important quick tools when it comes to tracking and debugging your code. With print commands you can do for instance:
- Checking the content of a variable, table, etc.
- Checking the data type of a variable.
- Checking the data frame shape (This is useful in ML applications)
below is an example of how to perform the tasks above:
In order to make sure that the function debugging is on, make sure to enable it by Tools -> Options -> Data Functions -> check Enable Data Function Debugging
4. Using python classes for better code debugging and re-usability
Classes are a big help when it comes to making our code reusable and maintainable. They also make the debugging process easier. Let's imagine a use case you have several csv files each containing pressure and temperature reading about a specific pump in a pump station. You want to create a Pump() object that has the following attributes/methods:
Attributes:
- Pump name
- Pump location (which is included in the csv file name)
- Pump size (included in the csv file name)
- Source file location
Methods:
- Find the max pressure value
- Find the max temp value
- Find the correlation between pressure and temp
- Find summary stats
class pump(): def __init__(self, path): self.file = file self.location = self.location() self.size = self.size() self.name = self.name() self.data = self.data() self.max_press = self.max_press() self.max_temp = self.max_temp() self.corr = self.corr() self.sumstat = self.sumstat() def location(self): return self.file.split('_')[1] def size(self): return self.file.split('_')[2].replace('.csv', '') def name(self): return self.file.split('_')[0] def data(self): import pandas return pandas.read_csv(self.file) def max_press(self): return max(self.data['press']) def max_temp(self): return max(self.data['temp']) def corr(self): from scipy.stats import pearsonr return pearsonr(self.data['press'], self.data['temp'])[0] def sumstat(self): return self.data.describe() file = r'C:\Users\aalattar\Downloads\pump01_houston_130kpa.csv' p = pump(file) print(p.file) print(p.location) print(p.size) print(p.name) print(p.data) print(p.max_press) print(p.corr) print(p.sumstat)
5. Understanding the Notification Details to identify error location
The notification details information not only shows debug information but as well as the actual error reported by the Python code. It also includes traceback which helps the specific line in your data function code that is reporting the error. In this error, it will display which lines are causing the error. Look for the line that mentions <module> as this is the origin of the error. If this line is a function call, then look for the line that references the actual function name. See the example below:
In this example, the <module> error is on line 233. In this data function line, 233 is actually a call to a function called select_best_model and the line in the data function causing the actual error is 203:
Enhanced f(x) Flyout and End User Experience
Spotfire 11.0 introduced the f(x) flyout to facilitate access to analytics tools and data functions. This allows a seamless interface to register and run library data functions.
1. Clear Display Names and Documentation
Proper documentation and instructions are crucial for others to understand and use data functions. The flyout will display the data function name, data function description, display names of each input and output parameter, and descriptions of each input and output parameter. To facilitate their use, ensure that the display names are clean and concise plus follow the terminology for the targeted industry or field. For example, while you might use 'input_df' in your code, it's better to rename it to 'Input Data'.
2. Optional Parameters
Some data functions, such as machine learning models, will have many input parameters so that the user has full flexibility to best configure or tweak the parameters. To avoid having data functions that are cumbersome to configure, make only the most essential parameters required. It helps to ask: what are the foundational inputs needed from the user so that the function can run and return meaningful results? You might not have the best model without testing various parameters but you'll have some starting point. For example, in this general TensorFlow regression and classification data function, only the input data, target column name, and task are required (appear as 'not configured' in the flyout); the remaining parameters can be left blank or overridden.
To make parameters optional, you will need to mark them as not required and often provide some default. Each input parameter in a data function is designated as "Required" or not. To make a parameter not required, leave the "Required Parameter" box unchecked.
There are standard code checks to check if these optional input variables have been specified by the user, and if not, then provide a default value. All non-required parameters appear as global variables and are set to null if no input is given.
If the parameter is a Python primitive type (this covers most input parameters of type 'Value'), the following code will take the user's input if it's provided otherwise continue with a default:
#save model to existing file path if provided otherwise save to a default file path #file_path is a string Value if not ("file_path" in globals() and file_path != None and len(file_path) > 0): file_path = 'best_model.h5'
If the input parameter is of type 'Column' or types 'Table':
#check for optional Column #if user does not provide a subset of columns to use, use all the columns in the input data if not ("selected_columns" in globals() and isinstance(selected_columns, pd.Series)): selected_columns = pd.Series(list(input_data.columns)) #check for optional Table #if user does not provide a new dataset for scoring, model is scored on training data if not ("new_data" in globals() and isinstance(new_data, pd.DataFrame)): new_data = training_data
NOTE: as of Spotfire 11.8, optional parameters can now be folded away/hidden and automatically moved to the bottom of the list of parameters using the 'Always show required inputs first' checkbox as shown below:
3. Design of Input and Output Parameters
Taking into account the design of input and output parameters can facilitate work for both Python data function developers and end users building dashboards. While there is a multitude of designs depending on the organization of your data and use case, here are some common scenarios.
If your features or columns of interest are few or set in number or you are working with columns from different tables, consider passing each column as its own input parameter of type 'Column'. If you have many or variable numbers of columns of interest, have them all under a single input 'Data table'.
Also, consider how the output parameters will be used in the data canvas and subsequent visualizations. If you have model predictions outputted as a column, these can be added as a calculated column to any data table. To plot model predictions versus observed values in a line chart, scatter plot, etc., it helps to have a long data table with the input data and model prediction rows concatenated and designated ('actuals' versus 'predictions'). This can be done in both the data canvas or data function (although there might be data duplication if you do it within data functions).
4. Suppressing warnings and error messages
It sometimes is desirable to hide warnings and also error messages that are output from a Python data function. This can be achieved with the following additions to your data function script:
-
Hide warnings
# at the top of the data function before other imports import warnings warnings.filterwarnings('ignore')
More specific settings for warnings can be achieved also. See the warnings Python documentation: https://docs.python.org/3/library/warnings.html
- Hide errors
Hiding all errors may not be possible depending on which Python package is producing the error. However, errors can be reduced and/or suppressed using the following code:
# add to the top of your data function import sys sys.tracebacklimit = 0
Advanced Uses
1. Using a Model Produced in A Previous Data Function
If you create an object such as a machine-learning model and you want to use that object in a second data function, you can use the pickle package to store the object in a document property in Spotfire. This way, the model can be set as inputs to other Python data functions. The following example demonstrates using the pickle package to manage this process. In the data function which creates the model object, the model object needs to be pickled and then set as an output of the data function:
#From data function number 1 import pickle modelObject = pickle.dumps(myObject)
The binary data (called modelObject in the above example) can be stored in a Spotfire document property when the data function parameters are configured in Spotfire, and then loaded from that document property to the second data function as an input parameter:
Then in the second Python data function script, this model can be easily restored:
#To data function number 2 import pickle myObject = pickle.loads(modelObject)
This technique should work generically for almost any object that you might want to retain between data function calls, not just models.
2. Formatting with an ID Column
Adding in an optional ID column helps relate each row from a data function results table with the observation it came from. This is necessary if there is a one-to-many resulting table or if the function manipulates or needs to know the order of observations in inputs. Here are two ways of checking for and otherwise providing a valid ID column as an input parameter (the validity check is that all values in the column are unique).
If the ID column (id_column) is a 'String Value' that is in the main data table (input_df):
#check if the ID column (id_column) is provided and valid in the data table model_data otherwise provide a default one #default ID starts from 0 and increments with subsequent rows if not ("id_column" in globals() and (id_column != None) and (id_column in model_data.columns) and model_data[id_column].nunique() != len(model_data)): id_column = "ID" model_data[id_column] = range(0, len(model_data)) assert(model_data[id_column].nunique() == len(model_data)) assert(id_column in model_data.columns)
Common Issues and Errors
Below is a list of messages and errors that may be returned when you try to run your Python data function. Common issues are often related to the data passed to the data function, such as passing an empty data table, and the Python is not set to handle this. The other common issue seen is trying to pass back an empty data frame from Python, or including invalid types such as lists in a data frame column.
1. Cannot determine type / all values are missing
This message points to the fact that either a column in your data frame is completely empty and therefore Spotfire cannot determine which type of column this is i.e. String, Numeric, etc. If Spotfire can't determine a type, an error is returned. To resolve this requires debugging your code to see why one of your columns is empty. There are two main methods to debug this: using simple print statements or debugging your code in an external IDE like visual studio. Also, see this StackOverflow question and answer. For example, add a print head statement before the last occurrence of your data frame to see how what is in each column:
## if my output data frame in my Python data function is called output_df print(output_df.head())
2. ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
This commonly occurs when you have a column in a data frame of an array or list type. For example:
## A list of lists - assuming we already have a data frame called df df['MyColumn'] = [[1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2]]
In this case, Spotfire can't convert a column where each item or row is a list rather than a distinct value. The solution here is to debug your outputs of the data function using simple print statements, such as using a printing a head() statement for your output data frame(s), or printing the types for each column and checking none are of type list or array:
## Check the first 5 rows to ensure no columns have lists or arrays print(df.head()) ## And/or check dtypes print(df.dtypes)
See the above guides on using simple print statements or debugging your code in an external IDE like the visual studio to help diagnose the issue
3. Passing in empty data to Python: IndexError: single positional indexer is out-of-bounds
Errors of this type commonly are due to an input data table being empty. This can easily occur if you set your data function input table to react to a filtering scheme or marking, and therefore it is possible to not mark any data, or filter out all data. In this case, your Python script will react in different ways and the error may not match this exactly. So it is important to handle the situation where an empty data table is passed into your Python data function. A simple method to handle this is to check the length of your input data frame, and if it is 0 you can raise an exception or return a dummy data frame for the output table (assuming your data frame has an output of a data table). For example:
## if our data function input data table is called input_df ## and our expected output data table is called output_df ## check we have some rows of data if len(input_df) > 0: ## Run your code as expected as we have data else: ## An empty data frame as been passed in ## If we want to prevent an error from occurring we need to return the output_df as expected ## To do this we create output_df with a single row of data output_df = pd.DataFrame(values = {'Column A': ['No Data Passed In'], 'Column B': [0.0]})
In the above example, we return a dummy data frame with a single row if no data was passed to the Python data function. This single row is added (rather than just having a completely empty data frame), so that Spotfire knows that data type each column should be as an output. So, if the expected output from the Python data function was to output a data frame with 2 columns called Column A and Column B, and Column A is a String column, and Column B a Real column, then in our dummy data frame we add these two columns and a single string value and a single numeric value; 'No Data Passed in' and 0.0 respectively.
top_10_tricks_for_terr_data_functions_0.pdf
Recommended Comments
There are no comments to display.