Prerequisites
The following requirements must be met to enable using the Data Function for Spotfire® Data Science - Team Studio in Spotfire® (the "Team Studio Data Function"):
- Spotfire® 7.13 (or later) client and server
- Latest copy of TSDF_*.sdn, which is a Spotfire® distribution file. For the Team Studio Data Function version 1.3 release the *4.8.sdn is intended for the Spotfire® installed client and Web Player on Windows (Windows framework). The *6.0.sdn is intended for the Spotfire® Web Player on Linux. The *.sdn file is a Spotfire® distribution file that bundles three spk files: TeamStudioCore*.spk, TeamStudioForms*.spk, and TeamStudioWeb*.spk, except for the *6.0.sdn file which does not contain a TeamStudioForms*.spk. Previous releases only contained the equivalence of the 4.8. This distribution is available on this Exchange page for download.
- Spotfire® Data Science - Team Studio ("Team Studio") version 6.4 or later.
- Data source set up in Team Studio. Any compatible data source will do, including TIBCO® Data Virtualization ("TIBCO DV").
Download
Data Function for Spotfire® Data Science - Team Studio in Spotfire® is available from the Spotfire® Exchange.
Installation and configuration
In order to add the Team Studio Data Function to the client software, you will need to upload and install the Spotfire® .sdn package detailed in the "Prerequisites" section to a deployment area in the Spotfire® server's deployment section. The .sdn file must be added to the Spotfire® client deployment that you intend to use with the data function which will update the desired deployment area and client configuration. After installation, any client will need to restart and connect to this area in order to receive the correct packages. Also, web player services intending to use the data function will need to be updated from the updated deployment area.
Click here for details on how to upload the .sdn file to the desired deployment area on the Spotfire® Server.
You will also need to have access to (i.e. be a member of) the Team Studio workspace referenced by the Team Studio Data Function.
Running the Team Studio Data Function
The Team Studio Data Function allows Spotfire® users to execute a workflow in the Team Studio platform and bring back results in the form of data tables. These tables originate directly from workflow operators, or from SBDF (Spotfire® Binary Data File) stored in the workspace. In addition, the Team Studio Data Function can trigger the reloading of tables when workflow results are stored in a database. This happens through Spotfire® data connections upon successful execution. Since in this case the Team Studio workflow and Spotfire® share a connection to the same data source (for example TIBCO DV), there is minimal data movement during the execution process.
The Team Studio Data Function differs from a typical Spotfire® data function (e.g TERR or Python) in that it is split into two parts, in order to facilitate resuming the reading of results from long-running, asynchronous jobs.
The first data function (Starter) initiates the job, and the second data function (Result) monitors it until completion. This is implemented automatically. If the Spotfire® analysis file is saved after the Starter data function has finished executing, the Result data function will automatically resume polling for the data function results, even if Spotfire® is shut down and restarted.
Example: Predicting Adult income class
Dataset
The input dataset is based on the UCI Adult Income dataset (Dua, D., and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: the University of California, School of Information and Computer Science). It contains 15 columns: 14 potential predictors (both numeric and categorical) and a target (income, with possible values <=50K and >50K).
The predictors describe the socio-economic metrics of the adult US population. The target variable indicates whether their income falls above or below $50K per year.
Use Case
This example builds a very simple data science process that generates a binary classification model to predict the income class.
An initial exploration of the dataset is performed (Summary Statistics operator). The data is then split into a Train and Test dataset. The Train dataset is used as input to a machine learning model (Random Forest Classification operator). The generated model and the Test dataset are then fed to a model assessment operator (Goodness of Fit) to test the model's quality.
There is one input parameter (a workflow variable called @ntree) and there are three output tables returned to Spotfire® (the results of Summary Statistics, the Variable Importance from RandomForest Classification, and the Goodness of Fit). In the following sections we will guide you through defining the inputs and outputs, and connecting them between Team Studio and Spotfire®.
Figure 1: The Team Studio example workflow
Team Studio Steps
The following steps will help you reproduce the Team Studio example workflow. It is assumed that you are familiar with creating and running workflows in Team Studio.
- Go to Actions > Workflow Variables and create a new variable called @ntree. Set it to 10.
- Use a Dataset reader operator to import the Adult income dataset that you will have previously uploaded to your Team Studio instance TDV data source. This example is based on a TDV data source.
- Attach to it a Summary Statistics operator. Select all columns and leave other defaults unchanged. In this example we set the Number of Most Common Values to Display to 1, to reduce output, but this is not strictly necessary.
- Attach an Export to SBDF operator to Summary Statistics. Set the Output File Name to SummaryStatistics.sbdf.
- Attach a Random Sampling operator to generate two samples containing 70% and 30% of the data, respectively.
- Extract the first sample (Train) using a Sample Selector operator.
-
Feed the results into a Random Forest Classification operator.
- Set the Dependent Variable to income.
- Set Use all available columns as Predictors to No.
- As Continuous Predictors select all apart from fnlwgt.
- As Categorical Predictors, select all apart from fnlwgt, education, and native_country. [Note: since fnlwgt is an integer, it could be interpreted either as a continuous or as a categorical variable, that is why it appears in both selections].
- Set the Number of Trees to @ntree, the workflow variable you just created.
- Leave the remaining parameters set to their default values.
- Extract the second sample (Test) using a Sample Selector operator.
- Feed the Test sample and the Random Forest Classification operator into a Goodness of Fit operator.
Your workflow is now ready to run. Press Run to check it runs and completes successfully.
At the end of the run:
- The Summary Statistics operator should have results similar to Figure 2 below.
Figure 2: Detail of Summary Statistics result table
- The Export to SBDF operator should have output SummaryStats.sbdf into the workspace.
- The Alpine Forest Classification operator should have results including a Variable Importance tab such as the one in Figure 3 below. The actual results may differ slightly depending on whether the random seed in the Random Sampling operator was set to a specific value.
Figure 3: Variable Importance result tab
- The Goodness of Fit operator should have Results showing a metrics table such as the one in Figure 4 below.
Figure 4: Goodness of Fit results
Spotfire® Steps
Setting up the Team Studio Data Function
Open up a Spotfire® DXP with a data table in a Spotfire® Analyst client. The Spotfire® client should be connected to a Spotfire® Server with the required packages installed in them.
Go to the Tools menu and click Team Studio Data Function > Create New Team Studio Data Function to open a new window. This will appear as shown in Figure 5 below.
Figure 5: Initial dialog box
Enter the URL of your Team Studio instance into the Team Studio location, and your Team Studio credentials into Login and Password. Press the Login button. Once logged on, select the desired Workspace and Workflow you want to connect to.
Note: It may take a few seconds for the Workspace/Workflow choices to populate.
Go to the Initiating Function Parameters tab, this will initially appear as in Figure 6.
Figure 6: Initiating Function Parameters dialog box
This tab is pre-populated with the Process ID parameter processid, a special variable that will contain the process ID of the Team Studio workflow execution. You don't need to change anything here.
If the connected Team Studio workflow has Workflow Variables that need associating to input parameters in Spotfire®, click to add them via the Add... button. The Name of each input parameter will need to be the same name as the corresponding Workflow Variable in your Team Studio workflow, excluding the "@" prefix.
- In our example, we will add an input parameter called ntree.
Once done, proceed to the Result Function Parameters tab, which will initially appear as in Figure 7.
Figure 7: Result Function Parameters dialog box
This tab is pre-populated with the success parameter, a special variable that will contain the timestamp of the successful Team Studio workflow execution. Its main purpose is to signal completion of the execution of the Team Studio workflow. You don't need to change anything here, unless you want to use the success parameter to signal the refresh of a data table not directly returned by the Team Studio Data Function (this will be described in Section "Reading of Results outside of the Data Function"). Note that the processid parameter also appears here. It is used to automatically connect the Starter and Result data functions.
Click the Add... button, and add as many output parameters as there are data tables to be returned by the Team Studio workflow. These need to be defined as Type: Table. In order to map your Team Studio workflow output tables, you have three possible choices:
- Connect directly to an operator's results. The Name of the parameter will need to reflect the exact label of the operator as it appears on the Team Studio workflow canvas. If the data table is taken directly from the Results, the operator's label will be sufficient. If it is taken from a specific tab within the Results, the name of the output parameter will need to be the operator label plus a pipe (|) separator followed by the exact name of the tab.
- In our example, the output parameter from the Variable Importance tab of the Random Forest Classification operator will be called Random Forest Classification|Variable Importance. This is because there are multiple tabs in the Results, as shown in Figure 8
- Similarly, the output parameter from Goodness of Fit will be called Goodness of Fit|Output, as there is a named tab called Output in the results.
Figure 8: The three tabs from the Random Forest Classification Results
To access these types of results, we must provide Spotfire® the operator name and the results information name in the format: "<Operator name>|<Results name>".
Note: The length of the table is limited to 999 rows (i.e. the maximum row display limit set in Team Studio) if extracted through this method. The recommended option for small tables.
-
Connect to SBDF Files: Workflow operator results that are exported as ".sbdf" into the workspace can be returned to Spotfire® as tables. The Name of the output parameter will need to be the exact name of the generated file, including the .sbdf extension.
- In our example, the file generated by exporting the Summary Statistics results is called SummaryStats.sbdf.
- External Table Refresh. The Team Studio workflow may read or write from/to databases - including TIBCO DV. Using the external table refresh mechanism, it is possible to refresh already preloaded external data locations such as Hadoop tables at the end of successful workflow execution. See Section "Reading of Results outside of the Data Function" for details.
After all the inputs and outputs have been defined, press Ok and proceed to map these to the appropriate objects within Spotfire®. This part of the process is done similarly to the traditional Spotfire® data functions, but you will need to take the processid and success parameters into account.
Starter function: Input Mapping
- map the specific input parameters you added, e.g. ntree in the example.
Starter function: Output Mapping
- map processid to the predefined document property ProcessId.
Result function: Input Mapping
- map again processid to the predefined document property ProcessId.
Result function: Output Mapping
- map success to the predefined document property Success.
- map the specific output parameters you added, e.g. Alpine Forest Classification|Variable Importance, Goodness of Fit and SummaryStats.sbdf in the example, to the desired output table names.
Visual Setup
In order to make it easier for a user to run the Team Studio Data Function, it is a good idea to make the input parameter and the data function refresh action dynamic. To this end, you might set up a Text Area configured as Figure 9 below.
- The input field (here shown set to 40) writes into a Document Property called num_trees (the name is arbitrary; the property will need to be associated with ntree in the Starter Function - Input mapping tab).
- The Execute Data Function button is set to trigger the Starter function.
Figure 9: Invoking the Team Studio Data Function in a Spotfire® Text Area
The three resulting data tables could be displayed in Spotfire® as in Figure 10 below
Figure 10: Display of results in Spotfire®
Additional Notes and Use Cases
Using the Web Player
Spotfire® Web Player can execute Team Studio Data Functions. In order to accomplish this, you will need to author a data function using Spotfire® Client Analyst and save the DXP file to either the Spotfire® localhost server or to a team server that has Spotfire® Web Player installed.
Once saved to the server location, you can then open the DXP file from the server location in a web browser, where you will be prompted to enter your Team Studio credentials and initiate the execution of the data function call. See Figure 11 below.
Figure 11: Enter credentials for Web Player
Note: Users cannot create or edit the Team Studio Data Function from within the Spotfire® Web Player interface. You can author the Team Studio Data Function from within the Spotfire® Client Analyst.
Multiple data functions
It is possible to create more than one Team Studio Data Function, each pointing to a different Team Studio workflow. The only thing you need to make sure of is to keep separate Document Properties to map respectively to the processId and success parameters of each data function.
Also, when opening an analysis containing multiple DataFunctions, Spotfire® may ask you to log in multiple times due to the parallel execution of data functions resulting in a queue of login prompts being created before Spotfire® has a chance to cache the login credentials. Once all DataFunctions have been executed once, credentials will be cached for the following runs.
Reading of Results outside of the Data Function
One limitation with the Team Studio Data Function framework is that it can only return data into the Spotfire® in-memory data engine: if the Team Studio workflow writes results to an external data source, Spotfire® will not automatically be aware that this data has changed.
If the Spotfire® analysis already contains a data table that points (or links) to an external table that is not directly populated by the data function, i.e. added separately through a data connection, Spotfire® will not automatically know if data has been refreshed at the source location (the Team Studio workflow).
If this is the case, you can add a mapping between the Team Studio Data Function's Success document property and a refresh trigger. This trigger, when fired at the change of the Success value, will result in a reload of a specific data table in Spotfire® as mapped in the Manage External Table section of the Result Function Parameters tab.
Figure 12: The Manage External Table dialogue
The Signal property will be mapped to the Success document property (the actual value of the success parameter) and the External data table is the specific data table we are interested in, as shown in Figure 13: in this image the data table is not selected yet, so the OK button is greyed out.
Figure 13: Setting up the External Table
Known Issues
- The document property used to store the processid parameter contains the cached process of the last run when the Spotfire® analysis file was saved. When the Spotfire® analysis is opened, if this property contains a value, the Result data function will try to use it to run automatically and might throw an error if the process id is no longer present on the Team Studio server. Try re-executing the data function to resolve this error. Another solution is to remove values from these document properties before saving the Spotfire® DXP. The same symptom (empty results because of a stale process id) may also occur when a user does not have permission to execute the workflow. Please ensure the user is a member of the Team Studio workspace.
- If a data function is linked to an Action button in a Text Area, and there are no inputs for the data function, then it may not re-execute when the button is pressed. The workaround is to use IronPython to directly execute the data function.
- If a data function appears to not run, this may be because the results of the workflow were empty. Please ensure the Team Studio workflow executes without errors by executing a test run directly in Team Studio.
- If Team Studio does not execute the data function, it may be due to a login timeout. The workaround is to trigger a re-execution of the data function, for instance by toggling an input parameter.
- A data function is linked to a Team Studio workflow by its unique workflow ID. If a copy of this workflow is created, even if it is renamed to the same original name, the data function will still be pointing to the old workflow and will need to be edited to point to the updated workflow.
- If a workflow has been copied between Team Studio instances, any existing related data function instances must be edited and re-pointed to the new workflow location, even if the names of the workspace and workflows are the same across Team Studio installations. An HTTP 422 error in the log files when running a copied DXP analysis file may be an indication of this.
- Execution of a workflow from Spotfire® while the same workflow is opened in the Team Studio UI may cause the workflow variables to display the values passed in from the Team Studio Data Function temporarily. Re-opening the workflow should correct this.
Recommended Comments
There are no comments to display.