Overview of Integration Mods for Spotfire® Data Science - Team Studio and TIBCO® Data Virtualization - Spotfire Statistica / Data Science

The Integration Mods for TIBCO DV in Spotfire® Data Science - Team Studio version 6.6 are able to connect to a diverse set of data sources. The underlying data sources are accessed via TIBCO DV. In order to combine the power of TIBCO DV to seamlessly handle many data sources with the machine learning capabilities of Spark 2.4, we provided 20 new operators designed to work with TIBCO DV data sources. This allows users to build a complete data science workflow whilst still being able to use standard database operators provided by Spotfire® Data Science - Team Studio. The SQL queries are pushed down to TIBCO DV, and the machine learning executes in Spark. The new operators are provided in the Exchange offering and each operator is individually documented. The result is a set of scalable operators capable of processing large volumes of data. In this article, we describe the new operators, how to use them and how to combine them in a data science workflow.

Pre-requisites

Spotfire® Data Science - Team Studio version 6.6
TIBCO® Data Virtualization (TIBCO® DV) version 8.4 or 8.5
Spark version 2.4

Overview

Figure 1. Overall view

Figure 1

shows how the Integration Mods (custom operators) for TIBCO DV in Spotfire® Data Science - Team Studio version 6.6 are able to connect to a diverse set of data sources. The underlying data sources are accessed via TIBCO Data Virtualization. In order to combine the power of TIBCO DV to seamlessly handle many data sources - including cloud data sources such as AWS S3 - with the machine learning capabilities of Spark 2.4, we provided 20 new operators designed to work with TIBCO DV data sources. This allows users to build a complete data science workflow whilst still being able to use standard database operators provided by Spotfire® Data Science - Team Studio. The SQL queries are pushed down to TIBCO DV, and the machine learning executes in Spark. The new operators are provided in the Exchange offering and each operator is individually documented.

The result is a set of scalable operators capable of processing large volumes of data.

In the following sections, we describe the new operators, how to use them and how to combine them in a data science workflow.

Using the TIBCO DV/Spark Mods

How to add them to Spotfire® Data Science - Team Studio

The .jar files for the 20 Mods, along with an integration pack to set up the connectivity between TIBCO DV and Spotfire® Data Science - Team Studio are available here on the Exchange. Follow this Knowledge Base article for Mods installation guidelines.
Follow the provided installation instructions to configure Spotfire® Data Science - Team Studio and TIBCO DV.

Where to find them in a workflow

Within an open workflow, look for the TDV category on the DATA palette to the left of the canvas (note that a TDV data source must be connected to the individual workflow file to be able to view the TDV operators in the dropdown):

Figure 2. The workflow TIBCO DV palette

There are 20 operators created specifically to work with TIBCO DV, these are listed in Figure 3 below:

Figure 3. The 20 integration Mods

Categories

The new 20 operators include 10 modeling operators: three Classification and three Regression modeling operators

Elastic-Net
Random Forest
Gradient-Boosted Tree

three model assessment operators

Confusion Matrix
Goodness of Fit
Regression Evaluator

and a Predictor.

Other operators that produce models to be used in conjunction with the Predictor operator are

K-Means Clustering
PCA
Isolation Forest.

Random Sampling and Sample Selector can be used to split the data into a Train and Test dataset. Summary Statistics performs the exploratory analysis.

Export Model and Load Model are available for re-using the generated model within the workspace. Export to TIBCO ModelOps and Export to File Storage provides the ability to export either a model or a dataset. More details are available in the subsequent sections.

Modeling Operators

The modeling operators are similar to the equivalent native product operators. However, these new operators execute on Spark 2.4 and have additional features such as the ability to carry out a grid hyper-parameter optimization to return the best performing model. An example of how this is expressed in the input parameters is shown below where the parameters of a Random Forest Regression model are selected

Figure 4. Calling Random Forest Regression

In Figure 5 below, the input parameters related to hyper-parameter exploration are shown for the Random Forest Regression operator of Figure 4 above. In most hyperparameters, a choice of values is available by inputting it as a comma-separated list. The number of times the grid optimization is performed is expressed by the Number of Cross Validation Folds parameter at the bottom of Figure 5.

Figure 5. Subset of Random Forest Regression input parameters

At the end of the run, the different parameters tested are displayed in the Results tab, together with the chosen metric (RMSE in this example - the choice of metric itself is an input parameter) and the best model is flagged. Figure 6 below shows a subset of the results: in this workflow a total of 64 parameter combinations were sampled 3 times.

Figure 6. Subset of grid optimization results

A small exception applies to the Elastic Net operators (both Linear and Logistic Regression), which have the option to specify the values of one of their hyper parameters in a slightly different way. The Penalizing Parameter(?) can be specified as usual by a comma separated list, like the one in Figure 7

Figure 7. Specifying ? with commas

Those will generate the values 0.0 0.5 and 1.0. An alternative is the form shown in Figure 8

Figure 8. Specifying ? with colons

This will generate 4 values, starting from 0.001 and ending at 0.1 but these values are now exponentially spaced. In this example, the values generated are 0.001, 0.004645, 0.02155, and 0.1. This can lead to a more optimal coverage of this parameter's space.

Figure 9. Sample grid optimization results with colon style ?

Finally, the model assessment operators (Goodness of Fit and Confusion Matrix for classification tasks and Regression Evaluator for regression tasks), and the Predictor operator (for classification and regression, as well as clustering, anomaly detection and PCA) work very similarly to the native version 6.5 operators.

PCA, K-Means Clustering and Isolation Forest

Two of these three operators replace equivalent operators from Spotfire® Data Science - Team Studio version 6.5.

PCA (Principal Components Analysis) has a simplified user interface. All the principal components (up to the number of selected columns in the input dataset) are generated, and the user can filter down to the required components, based on the desired amount of explained variance.

K-Means Clustering has the additional option of selecting the initialization method and, importantly, trying a range of possible K (the number of clusters to generate). The optimal K is calculated based on the Silhouette metric.

Figure 10. K-Means results with K-optimization metric

Isolation Forest is an entirely new operator that performs anomaly/outlier detection. A contamination level (expected number of anomalies) can be chosen in input. The output, once fed to a Predictor operator, contains columns to indicate the anomaly score and the resulting class (anomaly/not anomaly).

All three operators generate models that need to be applied to a dataset via the Predictor operator.

Figure 11. Example of generating and applying models

Random Sampling and Sample Selector

These operators are used to split a sample in two: a Train and a Test sample. It is important to note that Random Sampling/Sample Selector need to be step-run before running any downstream operators, as they do not inherit the schema (the column structure of the dataset that flows through). This is evidenced by the subsequent operators appearing in red.

Summary Statistics

Summary Statistics calculates descriptive statistics for the selected columns in a dataset. In addition to the capabilities of the native product operator, it can handle datetime columns, and outputs the coefficient of variation (the ratio between the standard deviation and the absolute value of the mean) of numeric variables, as shown in Figure 12 below.

Figure 12. Example Summary Statistics output

Import/export operators

There are four operators in this category: two of them enable exporting/ loading a model into/from the workspace. These are Export Model and Load Model. These operators store key information on the model in the form of a text file. The actual model persistence is provided by the ModelStore, a partitioned Parquet file that is located within TIBCO DV or optionally a table within a database accessed via TIBCO DV. For details on the ModelStore, please refer to the TIBCO Data Virtualization to Spotfire® Data Science - Team Studio 6.6 Integration Documentation.

The fourth operator, Export to File Storage, acts on a dataset (for instance, the output of a preceding operator) and writes it to the file storage specified in output, for instance AWS S3. The reason for this operator is to circumvent a current limitation of the native product operators when used with TIBCO DV. This limitation prevents writing a Table directly to certain file based data sources such as AWS S3, ADLS, HDFS. When using regular operators against file storage the View option for the Output Type parameter can be used, as shown in Figure 13. In this way results are not persisted to the file storage but instead calculated each time the operator is executed.

Figure 13. Storing a temporary result

If it is necessary to write the results back to actual File Storage (for example for processing by other tools, or for performance reasons) the Export to File Storage Operator should be used.

The Export to File Storage operator allows the results set to be written (persisted) as a file to the file storage. Files can be written in either the csv or parquet file formats. The file storage can either be a file system associated with the Spark cluster (local) or outside of the Spark cluster (remote). In either case a data source for the file storage must have been created within the TIBCO DV instance being used by Team Studio.

NOTE: To write to Remote File Storage, the TIBCO DV instance must be at version 8.5 or greater.

The Export to File Storage operator has the input parameters shown in Figure 14. The filename must have either .csv or .parquet extension.

Figure 14. Storing a csv file

Compatibility with existing operators

Most Explore and Transform database operators in the product will work with TIBCO DV sources - please see the product documentation for details. Since a TIBCO DV data source appears as a database source, the database flavor of the operators will be used. As mentioned in the previous section, for these operators the value of the Output Type input parameter needs to be specified as View if the underlying data source is file based (such as S3, ADLS or HDFS).

The new operators can connect native product operators. For instance, in a scenario where we want to do some exploratory data analysis, we might build a workflow like the one in Figure 15:

Figure 15. Example of connecting to native operators

In this example we have used a dataset stored in AWS S3 and attached it to the TIBCO DV/Spark Summary Statistics, as well as to the native BoxPlot and Aggregation operators.

Putting it all together

Let's now look at an example of how these new operators fit together in an end-to-end data science process. Figure 16 shows a screenshot of a workflow that generates a classification model for the UCI Adult income dataset:

Figure 16. Sample data science workflow

The input dataset is stored in AWS S3. Above the input dataset we notice two Explore operators: the new Summary Statistics and the Box Plot. The first one gives us an idea of the distribution of the columns in the dataset.

Figure 17. Summary Statistics output detail

There are 23 columns in the dataset. Two of these, education and education_num look suspiciously similar. We can test the idea that there is a one-to-one relationship between these two variables using Box Plot. The result, shown below in Figure 18, proves this hypothesis:

Figure 17. Box Plot of education vs education_num

We can therefore safely exclude one of these columns from the modeling. We choose to keep education_num as it is numeric and ordered, so it does not need substantial pre-processing.

Among the other columns, we also choose to ignore fnlwgt (Final Weight, a weighting factor that is usually discarded in analyzing this dataset) and native_country. The latter could be included at the price of increasing the memory allocation of the modeling operators, as any categorical column needs to be turned into a set of dummy variables within the modeling operators and this one has many more distinct values than the other categorical variables.

In order to simplify the setting of input parameters for the modeling operators, we use a Column Filter to remove these columns from the data stream.

At this point we are ready to split the dataset into a Train and Test subset, using the Random Sampling and Sample Selection TIBCO DV operators. We then apply both an Elastic-Net Logistic Regression and a Random Forest Classification TIBCO DV operator. As mentioned, these modeling operators have an in-built customizable grid search for the best model. They also have in-built encoding of categorical variables, and, the case of Elastic-Net, normalization of predictors. The modeling operators store a Spark model object, which records all the transformations applied within the modeling operators. Therefore any pre-processing calculated by the modeling operator on the Train set will be automatically applied on the Test set.

The Goodness of Fit TIBCO DV operator gives us an idea of how the two models performed:

Figure 19. Goodness of Fit result

The Random Forest appears to have performed overall slightly better, and we therefore choose to export it to the workspace, using the Export Model TIBCO DV operator. For demo purposes, we re-import it into this workflow (normally one would use a separate workflow) using the Load Model TIBCO DV operator, and re-apply it to the Adult dataset as a whole. As a final test of the results, we use a Frequency operator to compare the distribution of the actual versus the predicted values. We see from Figure 20 below that because of the marked imbalance in the training dataset, the model does overall a good job although, because of the imbalance in the dataset, it performs slightly less well at predicting high income.

Figure 20. Frequency distribution of actual and predicted income band

New in release 1.1.0

Added support for the following Spark configurations:
- Kerberos for Spark with Yarn cluster manager
- Spark Standalone cluster manager
- Spark running locally on the Team Studio server.
Added ability to export models to TIBCO ModelOps.
Added support for TIBCO DV 8.5.
ModelStore is now managed by TIBCO DV on the local file system or designated database.
Bug fixes for normalization in Elastic-Net Logistic Regression, Elastic-Net Linear Regression, and K-Means Clustering.
Simplified ModelStore setup in the installation scripts.
New option in Export to File Storage to choose an external file system, for example, a S3 bucket not associated with the Spark cluster.

Known Issues and Limitations

When using a Kerberized Spark environment, another Database will be required to store the temporary tables created by TeamStudio workflows.
When the scripts .jar file is loaded, the following warnings may appear - they can be safely ignored:

Figure 21. Warning messages

Because of a Spark requirement, a dataset with column names containing character(s) in the list: " ,;{}()nt=" (this includes variable names with a space) cannot be written to data sources. Please replace those characters with characters acceptable to Spark, either by modifying the schema before uploading data to the database source, or by replacing the illegal column names with a Variable operator before further processing.
For database operators, the value of the Output Type input parameter needs to be specified as View if the underlying data source is file based (such as S3, ADLS or HDFS).
Random Sampling/Sample selector operators need to be step-run before running any downstream operators, as they do not inherit the schema.
You need to make sure that the @default_schema workflow variable that is used in the Output Schema parameter in the integration Mods is pointing to the desired database schema. This can also be redirected while setting up the input parameter for the individual operator.
The tables or views are created in the selected schema as the result of the operation of certain operators. These tables and views will not be removed automatically when the nodes or workflows are deleted. They can accumulate over time and a housekeeping process should be run periodically to review and remove these tables and views.
For further details, please see the Product Release Notes and the TIBCO Data Virtualization to Spotfire® Data Science - Team Studio 6.6 Integration Documentation.

Sign In

Overview of Integration Mods for Spotfire® Data Science - Team Studio and TIBCO® Data Virtualization