Most important nodes in Statistica Workspaces

In the Statistica tool itself, we have a large number of functionalities and options. This article describes the functionalities that are according to our experience, the most commonly used ones.

Introduction

In the Spotfire® Data Science - Workbench (also known as Spotfire Statistica®) tool itself, we have a large number of functionalities and options. However, during your data analysis work, you will realize that typically you are using only a subset of these functionalities. Let us mention in this post the functionalities that are according to our experience, the most commonly used ones.

We will describe the functionalities in the form of nodes inside a workspace (visual workflow user interface). Each node represents one functionality inside this workspace. Before jumping to particular nodes, please make sure you have read Statistica Workspace - a graphical UI which is an article about the workspace user interface.

We will also try to describe types of usage or tips and tricks connected with the node. This article is not covering a complete list of functionalities for particular topics but can help with getting started with the Statistica.

Data Import

All the nodes that can do some data operations can be found in the Data tab of the Ribbon bar.

Any data analysis process will start with Data Access and Data Wrangling with Spotfire Statistica®. We will not describe Import Excel (for importing the data from Excel), Import Text (importing data from txt and csv file formats) or Import Spotfire (sbdf file formats): these are straightforward and you can find them in the Data->Import Data tab. Another option to import data is through the Enterprise Data Configuration node which is more important in larger-scale implementations and will be your most frequent node for data retrieving (this is the reason why it is listed in our top nodes article).

Enterprise data configuration

Functionality: This node is using a very simple method Enterprise Data Configuration to import any data source previously defined inside the shared Enterprise meta-repository (this metadata store is for sharing data connections, data configurations, workspaces, models, and other Statistica objects, more info here). No matter what database is under the hood, you can still use Enterprise Data Configuration for data loading into the tool.

Typical usage: Importing any data from any database in a simple way.

Tips/Tricks: In order to use Enterprise Data Configuration, there must be Database Connection (defining connection with the data source using OLEDB, ODBC, or ADO.NET drivers) and Data Configuration (defined query to the database specified via Database Connection) already configured inside Enterprise meta-repository.

Data preparation

Subset

Functionality: This node is able to filter data - create a subset of the data. You can filter rows as well as columns. Row filter can be based on conditions.

Typical usage: If you want to analyze particular rows. If you want to disregard some unimportant or temporary variables.

Tips/tricks:

During any variable selection in any node, you can select variables in two ways: through the selection of variables in the variable list by clicking (ctrl or shift keys may help) or writing variable names or variable numbers in the text field below.
You can use this node for reordering of variables (during the highlighting the variables from the list order are taken into account, for example, selection 1 2 3 and 1 3 2 will produce the output with 3 variables but in a different order).
Do not forget that variable names can be in the form of the actual name of the variable not only in the form of a variable numbers list (the consequence is that you can copy ctrl+c ctrl+v the variable list from some other parts/results which will help to not do manual choices of variables). Also, do not forget, you can insert a variable list in the form of wild card selection.

Transforms

Functionality: Compute new variables or recompute values of the actual variables using a wide range of transformation functions (available functions can be found under the Functions... button on the right-hand side of the dialog where variable formulas can be inserted. Please note that the formula will change values in all rows not only in some cells. One line in the Formulas section is one computed variable, please see the example of usage for date operations below (first 3 lines).

Typical usage: Creating additional features that can be used for further modeling. Extracting year, month, day of week information from DateTime variable. Adding temporary variables needed in subsequent nodes. Computing variables with several categories based on simple decisioning conditions.

Tips/Tricks:

If you want to have a row header (casename) inside the variable rather than in the header, you can use the function casename.
Function v0 is inserting actual row numbers inside the variable with that formula, this can be used as a special ID for rows or as a trick when you need to sort rows afterward according to some variable(s) but still you want info about the original order stored within the dataset.
You can create some group definitions for subsequent nodes. For example, we have measured distance and we want to have a variable with grouping per 50 meters (0-50 meters is group 50, 51-100 is group 100. You can do it for example by using these functions: distance50=ceiling("distance"/50)*50.
More details and more examples are in this article.

Variable Format/Rename

Functionality: This node enables the user to change most of the variable specifications like name, type, format, and measurement type. Two main functionalities are renaming variables without creating new ones and changing data type.

Typical usage: Renaming variables from the previous analysis to alter their original name. Repairing variable type in case it is loaded or outputted for some reason in the wrong format i.e. integer to double. Changing text date input to date format.

Tips/Tricks:

Changing the data type from Double to Text or from Text to Double before merging data using the Query Spreadsheet node (this node is expecting to have to merge key variables of the same type).
When updating settings in this node and you reopen the node set, you will see everything that was modified in red color.

The Variable Specification table above (the place where you are changing specifications - can be found under Edit Variable Specifications->All Specs...) works like an Excel table, you can ctrl+c ctrl+v list of variable names, if you insert e.g. Name1 Name2 in two cells, mark these two cells and hover to right lower corner and drag down, you can copy specification which will give you Name3, Name4,.
You can change also Measurement type, this can help in subsequent node variable selections. Typically in variable selection, there is a checkbox Show appropriate variables only and this checkbox will work exactly according to your definition of Measurement type.

Data Health Check Summary

Functionality: This is the diagnostics node. It is running various health check tests and summaries. It will tell with one click what is the situation concerning missing data, outliers, invariant variables, and redundancy of variables and brings back a lot of tables supporting these checks. It can clean the data automatically as well.

Typical usage: It will bring great info about the data state, after reviewing the report from this node, you can decide what data cleaning steps should be done, what variables should not be used for modeling, etc.

Tips/Tricks:

Use always this functionality as part of your data understanding process.
This node can also apply data cleansing to your data directly and output the cleaned data rather than simply reporting on it

Query Spreadsheets

Functionality: There are several merging functionalities within the tool (mainly functionality Merge) but most of them are combining together only two files (more details about merging nodes can be found in this article: Merging Tables in Statistica). Query Spreadsheets is an elegant way to merge the various tables and it is doing it using SQL query type manner. Each file that is connected with the Query Spreadsheet node will appear in the Advanced Query Builder and form some kind of database of inputs that can be used for the construction of SQL queries. It is possible to work in drag and drop (access type) way without the need to know SQL or you can write SQL code for merging operations.

Typical usage: Merging data from various data sources, and merging data with some computed results.

Tips/Tricks:

An important point to note is that the code of this node is taking into account the names of the nodes heading to the Query Spreadsheets node. Therefore, if you change the name of the node (typically by inserting another node between the old node and Query Spreadsheets), an error will occur because the Query Spreadsheets node will not find defined input. In order to avoid rewriting/redefining query, we propose to always use some nodes like Subset without any filters as input to Query Spreadsheets and in the case of the need for additional functionality placement, it is possible to place new nodes before this Subset node without affecting what is, in the end, heading to Query Spreadsheets.
If you insert the SQL query statement into the bottom section for code, the graphical representation of join will react on that and wise versa.

If the data are not clean (outliers, missing data, etc.) you would need to use data-cleaning techniques. These nodes can be found in the menu Data->Recode/Filter. However, we will not describe them one by one, usage should be clear and straightforward. We will mention only one data-cleaning functionality in this article:

Filter Sparse Data

Functionality: One of the typical data problems could be the presence of missing data. You can insert something to replace the missing values or simply remove these rows/columns with missing values from the analysis. Filter Sparse Data node is for deleting rows/columns with missing information.

Typical usage: Identification and deletion of the rows/columns where a specified amount of information is missing due to empty cells (missing data). The idea is simple, when there are a significant fraction of missing values in a row/column, this row/column is not bringing enough information for subsequent analysis.

Tips/tricks: You can use this node in casewise missing data deletion - any row in the data with any missing value will be deleted. This is a bit different usage than identifying sparse columns or rows. Below is the setting to do this:

Analysis

Breakdown Non-Factorial Tables

Functionality: This node computes descriptive statistics for each group defined by some categorical variable(s). It has the same functionality as the classical Descriptive Statistics node but with grouping/breakdown involved.

Typical usage: Maybe it does not look this way but this node is a widely used node. You can compute aggregations per category which can then be used in subsequent analysis or merged with some other data (e.g. computing customer summary characteristics).

Tips/Tricks:

You can use the node to compute frequencies of cross-categories for example. You will define categories in Grouping variables, as a dependent, you will choose any variable with no missing values and in Descriptives, you will choose only Valid N for output.
To use the output summary in subsequent analysis, do not forget to choose the downstream document. Also if the total for all groups is not needed, you can uncheck this output in Specifications and the output will be without the total.
You can have in output also various percentiles computed at once (in tab Results->Descriptives->Percentile Boundaries).

Select Variables

Functionality: This node is supporting SVB nodes (a short description about SVB nodes in this article: Statistica Workspace - a graphical UI). It is not possible to choose variables inside the SVB node, so the Select Variables node is doing it.

Typical usage: You would typically chain the Select Variables node with some SVB node. You will choose variables for the SVB node inside the Select Variables node.

Tips/Tricks: Some nodes have the checkbox Use previous input description. In that case, Select Variables can serve as this input description.

Modeling

Define Training Testing Sample

Functionality: Good practice for modeling is to calculate the model parameters on one data set (often called the Training set) and test final procedures on a different set (Testing data). This node is able to define these subsets randomly (split data into these subsets/samples).

Typical usage: The node will produce and add new variables which will contain categories (e.g. "Training" and "Testing" or more) randomly with specified percentage representation. The modeling workflow using these samples for modeling will look for example in this way:

Tips/Tricks:

Inside the node, there is a seed value to be specified. It is not important what value you will set but the seed is ensuring that a random generator will generate the same random numbers, therefore each run of the node will produce the same sample assignment into the groups (you are able to reproduce results thanks to that).

Feature Selection

Functionality: Typically you would like to find which variables/regressors are the most important for the task at hand. This node is one of the feature selection options which will bring back a collection of the most important variables sorted according to importance.

Typical usage: Feature Selection node is used to understand the dependencies between the response variable and regressors and also to reduce the number of potential regressors for subsequent modeling.

Tips/Tricks:

If you choose the downstream option Feature Selected Input Descriptor, the output spreadsheet will look like the original input file but inside there is info about chosen best predictors. If you connect the modeling node with the Feature Selection node and check the option Use previous input description inside it, variable selection will be automatically defined according to the most important variables.

PMML Model

Functionality: This node is containing the PMML code of a model. This node is automatically outputted when some modeling node is run. It can be inserted into the workspace separately and in this case, it can use a model from the disk or Enterprise meta-repository.

Typical usage: PMML model will be present in two scenarios and each time usage is a bit different. The first case is when PMML Model is the output of the modeling technique. In that case, every time the workspace is run, the model is recomputed. Below is the classical modeling workspace when you are creating several models and comparing them to new data.

The second scenario would be in case you have already the best model created and you want to use it for scoring new cases - predicting output based on the model and regressors. In that case, PMML Model node is starting node and the model inside is taken from the disk or from Enterprise shared meta-repository.

Tips/Tricks:

If you go to the PMML Model node, PMML tab you will see the button Deploy to Enterprise, this button will save the model into Enterprise meta-repository and you can use it for example with scoring workspace.
If you are using the PMML Model node for applying an already existing model (stored in Enterprise meta-repository), you can use the option Link (below the definition of the path to the file) which means that you will have the actual model (created by modeling workflow) every time you run the scoring workflow.
If PMML 4.2 model output is present in the PMML tab of the PMML Model node, you can connect this node with Publish Models node which will send the model into Artifact Management Server for later usage in TIBCO® Streaming.

Rapid Deployment

Functionality: The node needs to have two inputs - data set and one or more PMML Model nodes. Rapid deployment is the application of the models on connected data input.

Typical usage: It is helping with the evaluation of what is the best model (creating Lift Charts, for example, see this article: Gains vs ROC curves. Do you understand the difference?). And it is also applying models to new data and computing the predictions. Please see screenshots of the modeling and scoring workflow above.

Tips/Tricks:

During the classification task done by models, you would need as an output the probabilities for each category. This is not the default setting of the node, you need to go inside Specifications, check Include prediction probabilities in output and choose Input data, predictions, and residuals in the Downstream tab of the node.
Lift Charts and other model quality measures are outputted only in case the input data has information about the final response. If there are no response values available, there will be no data for creating Lift charts.

Rules Builder

Functionality: The user can build or redefine variables by inserting if-else conditions/rules logic which is evaluated casewise. Several other actions can be done in addition to simple assigning of the value into the variable according to the condition. You can for example run a model from Enterprise meta-repository if some condition is met.

Typical usage: Very useful when you need to insert values into variables according to some more complex logic. In the example below, there is assigning the actions and risk category based on predictions of the model and expert rules.

However, the most important usage is to run different models based on expert rules - for example, you would like to run different models for different countries or run some models only in some special situations. This is the biggest advantage of this node.

Tips/Tricks:

Rules Builder node can be used also as a Transforms node thanks to the possibility to assign formula into a variable. Thanks to the above-mentioned features, it can perform scoring (like Rapid Deployment) or recoding continuous variables into categories (like Recode node).
You can create a manual decisioning tree with the Rules Builder node.
Some features within the tool are producing the code, which can be loaded into the Rules node (e.g. Weight of Evidence, CART).
You can save rules code definition into Enterprise and also reference it from Enterprise (so, you can have the same rules from the same source document applied in more workspaces).
You can edit rules inside the node also as a text (option Edit Text).

Remarks

More detailed information about the node functionalities can be found in the help menu of the tool. Pressing F1 while the dialog for the node is used will open the help on the topic connected with investigated functionality. An example of pressing F1 during the investigation of the Results tab in Descriptive Statistics is below:

References and Important Links

Statistica Workspace - a graphical UI - Statistica Workspace introduction
Statistica Data Function in Spotfire
Merging Tables in Statistica - Detailed info about nodes for merging
Data Transformations Inside Statistica
Statistica terms and shortcuts

Sign In

Most important nodes in Statistica Workspaces

Introduction

Data Import

Enterprise data configuration

Data preparation

Subset

Transforms

Variable Format/Rename

Data Health Check Summary

Query Spreadsheets

Filter Sparse Data

Analysis

Breakdown Non-Factorial Tables

Select Variables

Modeling

Define Training Testing Sample

Feature Selection

PMML Model

Rapid Deployment

Rules Builder

Remarks

References and Important Links

Table of contents

User Feedback

Recommended Comments

Industries