Getting Started with Spotfire Data Science - Team Studio - Spotfire Data Science - Team Studio

Exercises to get started with Spotfire Data Science - Team Studio.

Introduction

Before beginning these exercises, users should have Spotfire® Data Science - Team Studio installed and connected to at least one data source - a supported database or Hadoop cluster. The dataset, election92, is used in these exercises. Either have it uploaded to your data source before beginning the exercise, or have access to a local copy that you can upload to your data source using Spotfire® Data Science. Copies of the data in xlsx format can be found attached to this article. Download and save the file as a .CSV for this exercise.

A Team Studio Application Administrator should create an Analytics Developer account for each user who will be participating in these exercises (subject to your license restrictions). To create an account, go to the People page, click Add Person and then fill in the required information. Please see the online documentation for more information on the variety of roles available in Team Studio.

Predicting Voter Turnout: Exploratory analysis and data transformation

This section is a step-by-step introduction to creating a predictive analytics workflow in Team Studio. We'll be using the election92 dataset which contains county-level demographic information from the 1992 United States presidential election. First we'll do some exploratory analysis of the data and then use demographics to predict voter behavior.

From the home page, in the upper-left corner, click the three-line menu, and then go to the main workspaces page. From there, click Create New Workspace and provide the required information.

Public workspaces allow all users in your organization to view and enter the workspace. By making a workspace private, only you and the other members of the workspace have access. Team Studio users that are Application Administrators can also enter the workspace.

After creating the workspace, the main page for the new workspace is displayed. You can access this page later by using the three-line menu button, the search bar, or the widgets available on the Home Page.

Add other users as members of your workspace

You can add team members and assign roles. Within a workspace, from the right sidebar, click the Overview tab, and then click Add or Edit Members.

Create a New Workflow

While in your workspace, at the top of the window, click the Work Files tab.
From the right sidebar, click Create New Workflow.
Provide a name, such as "Turnout Prediction", and connect it to an available data source.

You can change the data source selection later by clicking the workflow name under the Work Files tab.

Find the documentation

After you create the workflow, the workflow canvas is displayed. You can access the general documentation page from the lower left corner of the screen. Click the blue circle question mark.

To go directly to detailed documentation about any operator, click the operator to highlight it, and then click the question mark.

Add the file election92.csv dataset onto your canvas

To the left of the canvas, review the list of available operators.
To access the data source, from the top of the list, click Data.
If your administrator has already uploaded the file to your data source, you can navigate to its location and drag it to the canvas. Or, if needed, you can upload it by using the data tab.

Using Row Filter, select only the counties with above 40% voter turnout

If you are working on a Hadoop data source, notice that the election92 operator text is red. Red text always indicates that the operator has parameters that need configuration.

Double-click the operator to display its properties dialog box. Yellow highlights in the box indicate mandatory configuration settings.
In this case, click Hadoop File Structure.
Team Studio attempts to infer the data schema from a small subsample of the data.
If there are no issues, then click OK to close the window, and then click OK again to save your selections.
For database data sources, Team Studio reads the schema information directly from the database.
Select the Operators tab, and then type 'row filter' into the search bar under the tab.
Drag the Row Filter operator to the canvas.
Click the tabs (or 'ears') of the data set and drag the arrow to the Row Filter operator to connect the data set to the operator.
Double-click the operator, and then click Define Filter.
The dialog box has many filter operations, but more complicated commands might need to be scripted using SQL or Pig, depending on the data source. For our example, set column name as "turnout", condition as ">", and value as "40".
Either run the entire workflow or step-run the operator.

Within this filtered dataset, can you determine the minimum population? The median population? The average population?

You can find the answer to these questions with the Summary Statistics operator.

Connect the operator to your Row Filter operator.
Open the Summary Statistics properties dialog box.
Under Select Columns, configure Columns, and then, in the top right, click All to highlight every variable.
Click either Run or Step Run in your workflow.
To display the summary results at the bottom of the screen, single-click the Summary Statistics operator. You can also see that some columns have a lot of null values.

Filter out the columns with missing values

Use the Column Filter operator to filter out columns with null or empty values ("msa" and "pmsa"). The Column Filter operator passes through only the columns you select.

Determine which attribute is most positively correlated with voter turnout

Use the Correlation operator to calculate the correlation matrix for the columns turnout, crime, college, democrat, and republican. Alternatively, try the Scatter Plot Matrix operator for a view of similar information.

Sample the original data into an 80% training set and a 20% test set

From the Operators list, click the drop-down arrow, and then select Sample to see only the sampling category operators.
Use the Random Sampling operator to configure how many different sample sets you want in your analysis, and then specify their characteristics.
Use the Sample Selector operator for each set you created.
Double-click an operator name to rename it (for example, "Train Set" and "Test Set").

Build a linear regression model from the training set to predict voter turnout based on the variables that you think are most relevant

Select the Linear Regression operator and join it to the Train Set.
Configure the regression by setting turnout as the Dependent Column.
Click Select Columns, and in the Columns dialog box, select All columns, and then clear the check box for the county.
The county column has over 2000 categorical values, which means we cannot build a valid model using it. If you leave it selected, Team Studio gives an error.
Run the workflow.
Click the Linear Regression operator to see output statistics and fit coefficients.

Predict the turnout values for the observations in the test dataset and calculate some regression statistics

We have trained a model, but we need to make predictions.

Use the Predictor operator. It requires two inputs: a model output and a dataset with matching columns.
The output of the Predictor operator adds a new column to the dataset: PRED_LIR, the predicted turnout.
Use the Regression Evaluator (Dataset) operator to evaluate the predictions using several different metrics.
Set turnout as the Actual Value and PRED_LIR as the Predicted Values.

Determine which fraction of these predictions were within 3 percentage points of the actual value

Use a Variable operator to create a new column. It can take either SQL or Pig expressions, as appropriate for your data source, to perform data transformations.

To configure the Variable operator, click Define Variable
Provide a variable name (for example, close_pred.).
Set the data type as int.
Under Pig Expression, click the drop-down button and provide the appropriate code. (See the following example.)
Use the Frequency operator to find the answer.

Example

 Database (SQL):    case when abs(turnout - "P(turnout)") < 3 then TRUE else FALSE end

Hadoop (Pig):
ABS(turnout-PRED_LIR) < 3 ? 1 : 0)

Attachments

Download the file election92.xlsx from Resources below.

election92.xlsx

Sign In

Getting Started with Spotfire Data Science - Team Studio