Pre-Requisites:
AutoML for Spotfire® Data Science - Team Studio is supported on Team Studio version 6.5, 6.6, and for 7.0 if using the available version 6.6 workflow option. Users need a working installation of any Hadoop Distributed File System data source, and read/write privileges to said data source. Spotfire 10.3 or later is optionally required for model explainability visualizations.
- TIBCO Data Science- Team Studio 6.5, 6.6 or 7.0
-
Spotfire 10.3 or later: optional - required for Model Explanation visuals
- Data Function for Spotfire® Data Science - Team Studio in TIBCO Spotfire® version 1.1 or later
-
R packages:
- data.table (1.12.0 with TERR 5.0, 1.12.8 with TERR 5.1)
- Rtsne (0.15)
- Tested with LTS Spotfire versions 10.3 and 10.10.
Overview
Machine Learning (ML) is an increasingly popular branch of Artificial Intelligence, aimed at generating predictions from a dataset via an arsenal of computational algorithms and statistical methodologies. ML is a complex process that involves several steps: data exploration and cleaning, data preparation and feature engineering, model training, and finally model scoring and selection. Depending on our goals and on the nature of the dataset, at each step of the way we are faced with a richness of choices and decisions. Not only are such decisions increasingly hard, owing to the growing number and complexity of the available algorithms, they also involve time-consuming testing of different options and combinations.
The objective of Automated Machine Learning (AutoML) is to make this process more manageable, by automating the most complex and lengthy decisions. In this implementation, the AutoML generation, on one hand, enables analysts to quickly set up a meaningful process; on the other, because of the transparency of the generated system, it allows expert data scientists to see which decisions were made and why, and fine-tune the process if desired.
AutoML for TIBCO ® Data Science ? Team Studio is a set of Team Studio custom operators (hereafter called MODs) that generate workflows within Team Studio, plus a Spotfire model-explainability template used to gain insights into the model. The AutoML workflows are generated into a Team Studio workspace and run in sequence, to cover the end-to-end ML process. Individual MODs are included to perform data preparation, feature engineering, stability selection, and automated modeling, along with a high-level orchestration MOD (the AutoML Orchestrator) that uses built-in logic to assemble these operators into Team Studio workflows, run the analysis, and display all results. A Spotfire template is provided that integrates with TIBCO Data Science Team Studio to visualize and explore the resulting predictions.
From V1.2, AutoML can handle text variables (i.e. variables that contain unstructured sentences, rather than simple character strings) as predictors in binary classification models.
We have also added model explainability to AutoML: see how using model explainability and AutoML can jumpstart the process while still providing transparency. In this video of the most recent TIBCO Analytics Forum we provide a tour of our latest AutoML techniques, including use of text variables with explanations of how they affect your models and using Spotfire to visualise and interact with the results.
In addition to the 'getting started' information below, here is another short video and a blog that explain AutoML
by Neil Kanungo
- The Real Power of AutoML blog by Steven Hillion
Using AutoML for Team Studio
To get started, the first thing you need to do is create a new Team Studio workflow and read your dataset into it, as you would do for any workflow. AutoML is designed to work with Hadoop, so the data needs to be in, or copied to, Hadoop. You will need to create a single input data table, so any joins and merges to other data tables need to happen at this stage. AutoML expects data to be in wide mode, with column headers describing the content of each column. One of such columns needs to be the dependent variable (also referred to as response, or target) for the predictive modelling. For instance, if your task is to predict fraud, the target could be the label assigning each row of the dataset to fraudulent or non fraudulent behaviour. Currently, AutoML handles binary classification tasks, so the target column needs to have two distinct values. Missing data are generally allowed, but rows where the target value itself is missing will be filtered out.
After reading your data in, the next thing you need is to channel the data into the AutoML Orchestrator, as in Figure 1 below (in this example, three input datasets are first joined together and then sent into AutoML).
Figure 1. Example of AutoML-generating workflow
The AutoML Orchestrator is a MOD that contains in-built logic to generate a number of workflows. It needs some generic directions, such as the hostname and port of the Team Studio installation (i.e. the URL of the site you are running Team Studio on), login credentials, the name of the target (dependent) column and some information about the format of the input dataset (this needs to be checked carefully, as it depends on the output format of the last operator before the Orchestrator). You will also need to specify the output Workspace ID, that is the workspace into which all the workflows will be generated. Ideally, this will be a clean workspace, and it must be on the same Team Studio installation. It can be the same workspace where the generating workflow lives, as long as there are no pre-existing workflows with the same names as the ones that will be generated (see later for a complete list of these). The Workspace ID itself is a short number, the integer that comes after the hostname and port, and the #workspaces keyword in the Team Studio URL of the output workspace (which you will need to have created manually beforehand). Please refer to the AutoML Orchestrator documentation for further details.
There are new input options in version 1.2, related to the new capabilities of handling text variables. The first step in handling text is recognizing which variables do contain text. By default, AutoML applies an algorithm to detect text variables: the input parameter Automatic Text Column Detection is set to True. This automatic detection can be disabled by the user by setting it to False, then explicitly marking the columns to be recognised as text in User-defined Text columns. If automatic detection is set to False and no columns are marked, no categorical variable in the input dataset will be used as text.
Figure 2. Text columns Orchestrator parameters
Up to two columns can be used as text. It is important to note that each additional text variable (as we will explain later) will generate a potentially high number of extra columns when parsed, and consequently increase the requirements for memory and processing power when running AutoML.
The AutoML Orchestrator has options to generate a 'Shallow' or a 'Deep' AutoML. This choice controls the extent of the hyper-parameter search during the predictive modelling phase. A shallow search is quicker and gives a good initial idea of the most suitable model. A deep search will normally provide more accurate models, at the cost of being slower. All generated default values are visible as input parameters of the specific operators.
When the AutoML Orchestrator runs, visual workflows are generated and written into the output workspace. Workflows for model training are generated and run on the fly by the Orchestrator. They are run in sequence, as the output of one workflow becomes the input for the next one. Their names reflect the different phases of the process and are, in order: Target Learning, Data Preparation, (optionally, Text Feature Engineering when handling text variables), Feature Engineering, Feature Selection, and Modeling. In addition, two workflows are created but not run by the Orchestrator: Scoring and Explainability. The former can be used to apply the winning model to new datasets; the latter provides input to the Spotfire model explainability template that visualizes the results. Other artefacts that are not visual workflows are generated into the workspace. Shown in Figure 3 are the workfiles generated during orchestration and the top exported models, in Team Studio's Analytics Model (.am) format. Additional workfiles are generated by the Explainability workflow when this is run. See following Sections for details. Note that in Figure 3, the first one (AutoML Generate) is actually the generating workflow, containing the Orchestrator.
Figure 3. Typical workfiles generated into the output workspace
The Target Learning workflow checks that the variable declared as target contains the expected number of unique values, and stops the orchestration if more than two distinct values are found. It expects the input dataset in the declared format (see Orchestrator input parameters Upstream File Type and Delimiter) and will output an error if the actual format is not consistent with those values. From version 1.2, if automatic text detection was selected, it also calculates statistics on every categorical variable.
Inside the Data Preparation workflow (see Figure 4), the original target variable is mapped to a 0/1 integer to ensure consistent binary classification. The new target is renamed AutoML_Mapped_Target and is used as a dependent variable throughout the generated workflows. The mapping of the original target variable is performed within the Target Labeling operator. Some data cleaning is applied, notably, spaces in categorical, non-text, variables are replaced by dots (.).
The dataset is then split into Training and Testing sets (using an 80/20 row split) and summary statistics is calculated on the Training dataset. The AutoML Orchestrator also classifies categorical variables into groups according to their cardinality (the number of distinct values, or levels) and their imbalance (the ratio between the maximum and minimum frequency of these levels).
Figure 4. Example of generated Data Preparation workflow
In addition, date-time variables are handled by a custom operator, the Date Time Transformer. This operator extracts features such as year, month, day etc. from the input columns, as long as these columns are parsed as dates, datetimes or times according to a supported format. This is implemented within the Data Preparation workflow as shown in Figure 5. Please refer to the operator's documentation for details.
Figure 5. Example of Data Preparation workflow segment with DateTime handling
If text variables are present, whether automatically detected or manually specified, a Text Feature Engineering workflow is generated (see Figure 6). The task of this workflow is to encode every text variable into a set of numbers (or vectors) using the word2vec Neural Network algorithm, so the predictive-modelling algorithms can process those variables. The number of vectors to use is automatically selected by the system, according to the structure of each text variable. The text vectors will later be fed into the modelling algorithms alongside all the other predictors.
Figure 6. Example of generated Text Feature Engineering workflow
In the example shown in Figure 6 a text variable called description goes through word2vec encoding and is transformed into a number of numeric columns (vectors). The encoding generated with the Training dataset is then applied to the Test dataset. Please refer to the Word2Vec MOD documentation for further details.
Within the Feature Engineering workflow (see Figure 7) additional transformations are applied to the dataset. All transformations are implemented by taking into account the statistical properties that were computed on the Training dataset. Currently, AutoML supports missing data imputation, normalization using mean/standard deviation, impact (target-mean) encoding, weight-of-evidence encoding and frequency encoding. Please refer to the individual AutoML MOD documentation for further details. The last three are categorical encoding transformations, which are applied to transform categorical variables into numbers suitable for input into predictive-modeling operators.
Figure 7. Example of generated Feature Engineering workflow
Each data transformation that uses information from more than a single row (such as, for example, imputation using the mean) is performed after the Training/Testing split, using the parameters from the Training dataset (in this example, the mean) and then automatically applied to the Testing dataset using the same parameters. This ensures consistency of the ML process and helps minimize data leakage (the unintentional usage of Testing data information during the Training phase) and over-fitting. All data transformations, including Text feature engineering, follow this paradigm. The more complex categorical-encoding operators achieve this via a separate ?applicator? operator, called Categorical Feature Encoder, which is conceptually similar to the predictor or classifier operator of an ML algorithm: it takes a model (in this case, the encoding map) and applies it to a new dataset (the Testing dataset, or any new data flowing in). Depending on which cardinality/imbalance group they fall in, there could be different options for encoding variables. The result is potentially multiple branches or alternative strategies of feature engineering. A feature engineering strategy is therefore simply the specific sequence of transformations applied to the variables. Depending on the number of categorical groups in a dataset, there could be different numbers and compositions of such feature engineering strategies. In Figure 7 we see two alternative strategies (WoE and Impact Encoding). The corresponding encoded datasets are assigned names consistent with the strategy branch they belong to. Feature Selection and Modeling will then be applied to each output dataset.
The Feature Selection workflow (see Figure 8) applies feature selection techniques to the output of Feature Engineering: either using RandomForest variable importance, or randomised Lasso probability of inclusion. The AutoML Orchestrator naturally assigns a RandomForest stability selection to subsequent tree ML algorithms, and a randomised Lasso one to subsequent elastic-net Logistic Regression algorithms. In Figure 8, each of the two strategy branches output by Feature Engineering is further split into two branches for each feature selection method.
Figure 8. Example of generated Feature Selection workflow
The Modeling workflow (see Figure 9) handles the predictive-modeling phase. The modeling operators (elastic-net regularised Logistic Regression, Random Forest and Gradient Boosted Tree) each perform an internal hyper-parameter optimization implementing the open-source Spark MLlib. The resulting models are exported onto the output workspace in the Team Studio Analytics Model (.am) format (see examples in Figure 3). There is one output .am workfile per feature engineering strategy and ML algorithm type, representing the best model of each kind after hyperparameter optimization.
Figure 9. Example of generated Modeling workflow
At the end of the Modeling flow, all the different models that were generated are scored against the Testing dataset, and a model leaderboard is produced, sorted by the resulting accuracy (but displaying a number of other metrics as well). Once the AutoML Orchestrator has finished running, and the generating workflow has completed, the user can click on the Orchestrator icon to see a complete report of the AutoML, with details on all phases and models, and links to the individual generated workflows, as shown in Figure 10. Note that the additional Text Feature Engineering Flow Summary tab will only be present when the dataset contains text variables.
Figure 10. Example of AutoML Orchestrator results
Figures 11-14 show examples of the details that can be seen when clicking on Text Feature Engineering Flow Summary, Feature Engineering Variable Category, Feature Engineering Summary, and Model Leaderboard respectively.
Figure 11. Example of Text Feature Engineering Flow Summary
Figure 12. Example of Feature Engineering Variable Category
Figure 13. Example of Feature Engineering Summary
Figure 14. Example of Model Leaderboard
Finally, it is important to note that AutoML is designed to simplify and shorten the task of building an ML process, but a second appraisal of the workflows and the results is nevertheless encouraged. The success of Machine Learning projects depends as much on business knowledge and ingenuity as it does on sophisticated methods and algorithms. This is why all the workflows and results generated by the AutoML Orchestrator are transparent and editable so that they can be inspected, assessed, and fine-tuned where desired.
The Scoring workflow
The AutoML Orchestrator selects a winning model, according to the top row of the Model Leaderboard (see Figure 14) as well as the corresponding set of feature transformations (feature engineering strategy). In order for new data to be scored, all transformations plus the model need to be applied to the new data exactly as they were in the AutoML testing phase. This is the task of the Scoring workflow. An example of the Scoring workflow is displayed in Figure 15. In this example, a text variable was detected and is processed as well.
Figure 15. Example of generated Scoring workflow
The new data goes through all the data preparation phases, including feature extraction from datetimes where applicable, then the Text feature encoding, next the appropriate categorical variables are transformed (in the example, Impact and Frequency encoding are applied to separate sets of variables). Note that the Scoring Result node appears initially red. In order to activate it (turn it to black) the workflow branch up to that node needs to be executed. This can be done by right-clicking on the operator immediately before the Scoring Result, and selecting Step Run. The Scoring Result node can then be activated by double-clicking on it and pressing OK. After scoring, the resulting data set contains the additional variables as normally added by the model: for instance, using Gradient Boosting, the additional variables would be PRED_AGB, CONF_AGB and INFO_AGB.
The Explainability workflow and Spotfire model-explainability template
The concept of model explainability is not new: it is the foundation upon which scientific advancement is based. The type of models involved however has greatly changed through the centuries. Nowadays, machine learning provides us with very sophisticated models, way more complex than the traditional scientific formula. The price to pay has been an ever decreasing understanding of the rationale behind the automated decisions. Model explainability is an open and active research field; AutoML for Spotfire Data Science provides a window into the model generated by the Orchestrator, by exploiting the integration between Team Studio and Spotfire, and the visual dynamism of Spotfire powered by TERR and IronPython. The components of this feature are an additional (automatically generated) Team Studio workflow (Explainability), an extension to Spotfire for running Team Studio data functions (Data Function for Spotfire® Data Science - Team Studio in Spotfire®) and a Spotfire model-explainability template with embedded TERR data functions and Iron Python automation.
The Explainability workflow is designed to prepare the data for usage in the Spotfire template. It is automatically generated by the Orchestrator, though not run during the orchestration phase. The associated Spotfire template contains a pre-defined Team Studio data function that connects directly to this workflow.
The Explainability workflow takes as input the training dataset, pre-transformed to prepare it for the winning model. It uses a MOD called Data Grid Builder, to reduce the size of the dataset by mapping numeric variables onto a grid, the granularity of which can be controlled from Spotfire. Another branch generates a representative sample of the same input dataset and uses the Data Reshuffling MOD to create and collate a number of copies of the dataset in which each predictor column in turn has been randomized. Further details on these two operators can be found in their documentation. Both branches score the data by applying the winning model and then export the result as SBDF (Spotfire Binary Data Format) ready to be consumed by Spotfire.
When text variables are present, additional branches are generated to parse the highest-scoring records into their component words, and apply algorithms to find the most influential words. These are handled by the WordMapper MOD.
Figure 16. Data processing branches of the Explainability workflow
Figure 17. Text processing branch of the Explainability workflow
Another part of the Explainability workflow collects and exports all the other transformations that were applied to the input dataset. These too are turned to SBDF files and will be used by Spotfire to reverse (decode) the transformations, and present the insights in human-readable form.
Figure 18. Data decoding branches of the Explainability workflow
The Spotfire model-explainability template will connect to the Explainability workflow and run it, to extract its output datasets. It will then perform calculations on the data to inspect the model's behaviour.
When the Spotfire template is first opened, a login box appears (Figure 19). Since no Team Studio instance is connected yet to Spotfire, some initial configuration is needed:
- Click Cancel to exit the Login box
- Go to Notifications and click on Dismiss All
- Go to File | Manage Trust and Script and click on Trust All, OK then Close
- Go to Tools | TERR Tools | Package Management and install package data.table from CRAN, then Close
- Go to Tools | Team Studio data function | Edit Team Studio data function
- Select the available data function, then OK
- Fill in the form with your Team Studio instance url, username and password, then Login
- Choose a Workspace and select the Explainability workflow within it
- Press OK twice then Yes.
The Team Studio instance containing your generated AutoML workspace and Explainability workflow is now connected to Spotfire. For further documentation on the Team Studio data function setup, see Data Function for Spotfire® Data Science - Team Studio in Spotfire®).
Figure 19. Team Studio data function login box
The Team Studio data function within the Spotfire model-explainability template needs to be connected to the Explainability workflow generated by the Orchestrator. The data function is predefined to take as input the number of bins for the Data Grid Builder custom operator to decide how to bin numeric predictors, and return as output up to seven SBDF files, as generated by the Explainability workflow. Additional I/O parameters processid and success are used to guide the running of the data function and do not need to be modified by the user (see documentation for details). The Team Studio data function can be redirected to any generated Explainability workflow by following steps 5 to 9 above, and it can be run again with different bin sizes by clicking on the Generate Explanations button top-left on the EXPLORE page (see Figure 20). The results of this data function are automatically used to generate a variable importance chart via an embedded TERR data function. Only the top predictors are displayed; this behavior can be overridden by configuring the bar chart and clearing the Limit data using expression in the Properties|Data tab.
Figure 20. Example of Spotfire template EXPLORE page at start
The bars are color-coded according to data type. In this example we have three numeric variables, one categorical and one text variable.
The EXPLORE page is designed for interactivity and exploration. A number of TERR data functions and IronPython scripts (embedded in the Spotfire template) perform calculations on the data and react to marking and selections. The top portion of the layout remains essentially the same. The left panel is for reloading data from Team Studio. In the middle, we have information on the dataset, the response variable and the current selections. The text appearing on the right will change depending on the predictor selections. At the extreme right, a set of utility buttons controls zooming, filtering and the display of help text. The top button takes you back to the start view of Figure 20, where all selections are reset. The rest of the layout will change according to which, and how many, predictors are selected.
The Top Predictors bar-chart on the left responds to clicking; selecting a variable in this chart (for instance, alcohol from Figure 20) has the effect of updating the page layout to that shown in Figure 21.
The Correlations vs Importance plot now appears on the bottom-left corner. It shows the interplay between the correlation of the selected predictor with other predictors, and these predictors' relative effect on the model. This information can be used to get an idea of the balance between variable association and importance. In this example, price is important to the model's prediction, but not very correlated to alcohol. Here too, the colors are by data type.
The bar-charts in the centre and right of the page show distribution of record counts (Volumes) and probability scores (Scores) for the different values of alcohol. The charts display the data after it has been mapped on a grid (binned): so the labels of alcohol show the average bin values, and the scores shown on the right chart are weighted by the cell occupancy on the grid. The error bars incorporate the effect of the other predictors on the model's predictions for alcohol. This effect can be analyzed by visualizing the filter panel (click on the Show Filters button, top-right) and exploring how sliding the values of the other predictors affects both the available data and the response.
Figure 21. Example with a numeric predictor selected
The Correlations vs Importance plot does in turn react to clicking: by selecting a second variable, for instance volatile_acidity, the layout changes again, and the two variables are shown in a scatter plot. The size of the markers is proportional to the volume of data, and the coloring reflects how predictions can be 'sliced' using the button panel that appears in the centre of the page: in this example the response is the quality of wines, which can be 'poor' (orange) or 'good' (purple). Grey is used to denote areas where the model gave a less reliable prediction, either because of poor score probabilities, or low volumes of training data, or high variability for the selected values.
Figure 22. Example with two numeric predictors selected
The different layouts are generated according to the data type and cardinality of the selected predictor(s), and whether one or two predictors are selected.
Text predictors contain more complex structures. After encoding (see Text Feature Engineering) a text predictor is turned into a number of numeric variables (vectors): that is, an individual text variable is represented by more than one number. The model sees these vectors as independent predictors: each vector in turn can be more or less important as a predictor to the model. The importance chart summarizes this information by listing a single entry per text variable: this entry corresponds to the most important vector associated to that variable. This 'top' vector is used throughout as a representative of the text variable. For some plots, the 'second' most important vector is also used. In Figure 23 the user selected description from the importance chart, and a scatter plot involving vector #12 and #41 associated with the description text variable is displayed. This layout is very similar to the one in Figure 22 although in this case it is related to two vectors of a single text variable, rather than to two independent variables.
Figure 23. Example with a text predictor selected (Vectors Plot view)
A powerful insight into how the model uses text predictors is available, by selecting View As Word Plot in the drop-down list in the middle section. This view (see Figure 24) is based on the words that are contained in the selected text variable. The words displayed are the ones that most influenced the model, sized by their frequency of occurrence, and colored by the predicted response. Their positions on the plot highlight the similarity or distance between concepts. When the two groups appear well separated, it can indicate that the text variable did help the model to discriminate the response. A number of controls are available in the middle panel to make it easy to interact with the display: such as choosing different ways of representing the words in two dimensions (t-distributed stochastic neighbor embedding or t-SNE, vs multidimensional scaling or MDS), hiding words that only contain numbers, or hiding shorter words. For t-SNE plotting, additional parameters governing perplexity, theta and number of iterations are available to tune manually as inputs to the 2d Projection TERR data function.
Figure 24. Example with a text predictor selected (Word Plot view)
Manual Changes and Re-execution
In some cases, a user may want to tweak some of the generated workflows, for instance changing parameters or re-adding filtered out columns. Many of the restrictions in the previous release of AutoML have now been lifted, provided that users follow a handful of guidelines, which are detailed in the AutoML Orchestrator MOD documentation.
Installation
AutoML comes with .jar files that need to be manually installed on an existing Team Studio environment. Once downloaded and unzipped, there should be AutoML_Models-1.20.jar, AutoML_Orchestrator-1.20.jar, Feature Engineering-1.20.jar, StabilitySelection-1.20.jar, DateTimeOperator-1.20.jar, DataGridding-1.20.jar and Reshuffling-1.20.jar, Word2Vec-1.10.jar, and WordMapper-1.10.jar.
These files can be installed on a Team Studio Environment by navigating to, and opening, any workflow located on the instance that is to be used for AutoML. Once the workflow is opened, users can upload Custom Operators (the AutoML .jar files), by selecting 'Actions' -> 'Manage Custom Operators' (see Figure 25 below)
Figure 25. Manage Custom Operators Action
Once here, select the ?Upload? button & find the .jar files on the local machine. When all four files are uploaded, then AutoML is ready to be run on the environment.
In order to enable the Team Studio data function, please download Data Function for Spotfire® Data Science - Team Studio in Spotfire® and follow the installation instructions.
Release Notes
- The AutoML Orchestrator only supports Hadoop data sources.
- The target column should be a binary column of any data type. The operator will halt if there are more than two levels in the target column.
- In the rare cases where integer variables have NaN values, such as values generated by zero division, the corresponding rows may be silently removed after the data has been read in. To prevent this, users should declare all numeric variables as double rather than long in the Hadoop File Structure section when importing the dataset.
- There should not be a column named 'label' in the data source, as this has a special meaning in the Spark pipeline.
- Rows with null target values are filtered out. This is done in the FilterTarget operator of the Data Preparation.
- Non-word characters (including spaces) from categorical variables are turned to dots (.) during data preparation. This is done in the Data Cleaning operator of the Data Preparation flow.
- Categorical variables with zero variance (only one value) or a number of unique values equal or greater than 60% of the number of rows in the training dataset are removed from further processing. This is done in the Column Filter operator of the Feature Engineering flow.
- Variables with over 40% missing values are removed.
- No correction for target class imbalance is applied.
- AutoML no longer expects schemas to be always preserved. However some restrictions still apply (see section on Manual Changes and Re-Execution, and AutoML Orchestrator documentation for details).
- When opening the Model Explainability DXP for the first time, you will encounter errors because it has not been configured to work with your instance of Team Studio. Please follow all setup instructions listed above.
- Memory requirements: for Spotfire full stack (server+client installation) a machine with 16GB of memory is recommended.
Recommended Comments
There are no comments to display.