Introduction
Data Functions are a great way to encapsulate complex but repeatable calculations within a Spotfire analysis. Spotfire data (tables, columns, document properties) can be sent to a data function and the same kinds of data objects can be sent back to Spotfire.
There are some restrictions, for example, columns of a Spotfire data table can be modified by only one data function, and complex objects such as R models cannot be exported to a Spotfire table:
Situations can arise when more functionality than this is needed, for example:
- A data function creates an output data table, but you want to modify this table using a second data function
- A data function builds a model which you would like to re-use in a second data function - but the model cannot be neatly expressed as a data table
- You want to create a number of intermediate data structures, but don't want to clutter up the Spotfire environment with tables and document properties - you just want these to be available for the downstream data functions.
A way to accomplish all of these things is to use a "binary object" also known as a "Blob".
The essential idea is that any R object can be compressed into a binary object using the utility function SObjectToBlob(), for example, if "mydata" is an R data frame, it can be converted into a binary object using:
mydataBlob = SObjectToBlob(mydata)
illustrated here:
The R blob object ("mydataBlob" in the example) can be returned to Spotfire as a Document Property. This Document Property be used as input to a second data function which can update "mydataBlob" and modify the Spotfire Document Property, in effect giving multiple data functions the ability to modify any data objects stored in the binary object, such as data tables.
For example, with this approach,
- Data function #1 can be developed to carefully fit a model and save the results into a blob (as a Spotfire Document Property)
- Data function #2 can then apply this fitted model to new data points, making predictions
As another possibility,
- Data function #1 might automatically fit a large number of curves en masse (e.g. production curves, population curves, etc)
- A user might inspect each fit and make custom adjustments which are then stored alongside the original fitted values
Example: Logistic Regression Model fitted and passed to second data function
As an example, we use data describing student applications to graduate school and logistic regression (example from UCLA Statistical Consulting Group). In Spotfire, the data table contains information on 200 students: their GPA (grade point average), GRE (Graduate Record Exam) score, prestige ranking of their undergraduate university, and the outcome of their admission status (the name of the data table is "binary"):
We send this data to a TERR data function whose purpose is simply to fit a logistic (glm) model and return the resulting fitted model as a binary object:
# [TERR] Fit Logistic Model to Blob # Logistic regression on student data (example) # Input # StudentData (columns: admit, gre, gpa, rank) # Output: # ModelBlob # TimeStamp=paste(date(),Sys.timezone()) if(file_test("-d", "C:/Temp")) suppressWarnings(try(save(list=ls(), file="C:/Temp/abc.in.RData", RFormat=T ))) # remove(list=ls()); load(file='C:/Temp/abc.in.RData'); print(TimeStamp) # use in development StudentData$rank = factor(StudentData$rank,levels=1:4) admitModel = glm(admit ~ gre+gpa+rank,data=StudentData,family="binomial") # (Run the following in interactive mode) # library(SpotfireUtils) ModelBlob = SObjectToBlob(admitModel)
Let's look at the code piece by piece.
The first section of the code:
# [TERR] Fit Logistic Model to Blob # Logistic regression on student data (example) # Input # StudentData (columns: admit, gre, gpa, rank) # Output: # ModelBlob #
Is simply a set of comments describing what the data function does, and the inputs and output. One input is needed, a table that will appear in the code as "StudentData", with expected columns admit, gre, gpa, and rank. The output of this data function is a single object named "ModelBlob".
Next comes a handy utility that I use all the time for developing code. It creates a temporary file named "C:/Temp/abc.in.RData" which stores all of the data as it appears coming into the data function. This can be extremely valuable when developing code, for example, the columns of tables may not be in the expected order, numbers might appear as integers instead of real values, etc. Corresponding to the "abc.in.RData", I often use a similar command at the very end of a data function, creating "abc.out.RData" in case I want to check the expected outputs against what comes back into Spotfire.
After running this code once, I can then turn to an Interactive Development Environment (IDE) for R, such as RStudio, and execute the commented-out line to load the data and continue developing the code.
TimeStamp=paste(date(),Sys.timezone()) if(file_test("-d", "C:/Temp")) suppressWarnings(try(save(list=ls(), file="C:/Temp/abc.in.RData", RFormat=T ))) # remove(list=ls()); load(file='C:/Temp/abc.in.RData'); print(TimeStamp) # use in development
This next snippet actually fits the model, after converting the "rank" column to a factor variable:
StudentData$rank = factor(StudentData$rank,levels=1:4) admitModel = glm(admit ~ gre+gpa+rank,data=StudentData,family="binomial") # (Run the following in interactive mode) # library(SpotfireUtils) ModelBlob = SObjectToBlob(admitModel)
The "SpotfireUtils" library is needed for the SObjectToBlob() function, this library is automatically loaded at runtime but will need to be loaded in an interactive session
The result, ModelBlob, is now suitable to be returned as a Spotfire Document Property.
Second data function
We now create a second data function whose only job is to apply the fitted model to new data points ("scoring").
We have a small data set of two hypothetical students with varying gpa, gre, and school rank values. We want to apply the previous model, without re-fitting, to this new data table.
Our second data function has two inputs, the binary object from before, and the new data table.
# [TERR] Apply model to new data # Input # ModelBlob (Spotfire binary document property) # NewStudentData (columns: admit, gre, gpa, rank) # Output: # probability (column) # Unpack the model previously stored in the binary object: admitModel = BlobToSObject(ModelBlob) NewStudentData$rank = factor(NewStudentData$rank,levels=1:4) admitProbability = predict.glm( object=admitModel, newdata=NewStudentData, type="response" ) prediction = data.frame( probability=admitProbability )
The first step in the code is to apply the BlobToSObject() function to the binary object that was stored in the Spotfire document property. The result is exactly what we originally packed inside, the fitted model, "admitModel". Following this, we convert the rank column to a factor variable as we did before.
The main activity here is the "predict.glm" function call which applies the model "admitModel" to the new data "NewStudentData" to generate a vector of predicted probability of admission for each hypothetical student.
Finally, we construct a data frame that contains this probability and output this as a 1-column table.
Running this in the Spotfire environment, the output is sent as a new column to the new student data table. The benefit of using a table as output is that the name we gave the column ("probability") is carried through to the new column
In Spotfire, we find the new column of probability appended to our test table of new students.
Summary
In summary, whenever a calculation is shared among two or more TERR data functions, or a result from one data function is used by other data functions, the use of binary objects has a number of advantages
- The Spotfire environment does not get cluttered with "work" data tables that the user does not need to see
- Complex objects such as R fitted models can be saved as-is and re-used or modified by other data functions
- The binary object is stored in a Spotfire document property, and can be modified by a second or third data function, unlike tables
- If it is useful to display a resulting data table, a dedicated data function can be written to ingest the blob, extract the data table and return to Spotfire for rendering in an appropriate visualization
- The blob can also be saved directly to a disk file from R and archived for re-use or shared.
References
Logit Regression | R Data Analysis Examples. UCLA: Statistical Consulting Group.
from https://stats.idre.ucla.edu/r/dae/logit-regression/ (accessed July 11, 2019)
Data from https://stats.idre.ucla.edu/stat/data/binary.csv binary.csv
Recommended Comments
There are no comments to display.