Flexible options are provided to allow comparisons between variables (e.g., treating the data in each column of the input spreadsheet as a separate sample) and coded groups (e.g., if the data includes a categorical variable such as Gender to identify group membership for each case). For the t-test for independent samples, options are provided to compute t-tests with separate variance estimates, Levene and Brown-Forsythe tests for homogeneity of variance, various box-and-whisker plots, categorized histograms and probability plots, categorized scatterplots, etc.
Other more specialized tests of group differences are part of many modules (e.g., Spotfire Statistica® Nonparametric Statistics, Spotfire Statistica® Survival & Failure Time Analysis and Spotfire Statistica® Reliability and Item Analysis
]]>Major applications of structural equation modeling include:
Many different kinds of models fall into each of the above categories, so structural modeling as an enterprise is very difficult to characterize. Most structural equation models can be expressed as path diagrams. This program uses a command language (PATH1) that looks very much like a path diagram. Consequently even beginners to structural modeling can perform complicated analyses with a minimum of training. Although it is not absolutely necessary, understanding factor analysis before attempting to use structural modeling is recommended.
]]>Credit scoring is perhaps one of the most "classic" applications for predictive modeling, to predict whether or not credit extended to an applicant will likely result in profit or losses for the lending institution. There are many variations and complexities regarding how exactly credit is extended to individuals, businesses, and other organizations for various purposes (purchasing equipment, real estate, consumer items, and so on), and using various methods of credit (credit card, loan, delayed payment plan). But in all cases, a lender provides money to an individual or institution, and expects to be paid back in time with interest commensurate with the risk of default.
Credit scoring is the set of decision models and their underlying techniques that aid lenders in the granting of consumer credit. These techniques determine who will get credit, how much credit they should get, and what operational strategies will enhance the profitability of the borrowers to the lenders. Further, they help to assess the risk in lending. Credit scoring is a dependable assessment of a person's credit worthiness since it is based on actual data.
A lender commonly makes two types of decisions: first, whether to grant credit to a new applicant, and second, how to deal with existing applicants, including whether to increase their credit limits. In both cases, whatever the techniques used, it is critical that there is a large sample of previous customers with their application details, behavioral patterns, and subsequent credit history available. Most of the techniques use this sample to identify the connection between the characteristics of the consumers (annual income, age, number of years in employment with their current employer, etc.) and their subsequent history.
Typical application areas in the consumer market include: credit cards, auto loans, home mortgages, home equity loans, mail catalog orders, and a wide variety of personal loan products.
The overall objective of credit scoring is not only to determine whether the applicant is credit worthy, but also to attract quality credit applicants who can subsequently be retained and controlled while maintaining an overall profitable portfolio.
The classic and still widely used (and useful) approach for evaluating credit worthiness and risk is based on the building of "scorecards"; a typical scorecard may look like this:
For each predictor variable, specific data ranges or categories are provided (e.g., Duration of Credit), and for each specific category (e.g., Duration of Credit between 9 and 15 years), a Score is provided in the last column. For each applicant for credit, the scores can be added over all the predictor variables and categories, and based on the resulting total credit score, a decision can be made whether or not to extend credit.
There are several aspects of the particular modeling workflow for producing a scorecard, and for using it effectively.
First, in order to make a scorecard effective, it needs to be easy to use. Often, the decision to extend or deny credit must be made very quickly so as not to jeopardize a "deal" (e.g., selling a car). If a decision to extend credit takes too long, then the applicant might look elsewhere for financing. Therefore, and in the absence of automated scoring solutions accessible for example via a web page, a scorecard needs to make it easy for it's user to determine the individual components contributing to the overall score and credit decision, and to achieve that, it is useful to divide the values of each continuous or categorical predictor variable into a relatively small number of categories so that an applicant can be quickly scored. For example, a variable Age of Applicant could be quickly coded into 4 categories (20-30, 30-40, 50-60, 60+), and the appropriate scores associated with each category typed into a spreadsheet to compute the final score.
There are a number of methods and considerations that enter into the decision how to re-code variable values into a smaller number of classes. In short, it is desirable (again, from the perspective of making it "simple-to-use") that the credit score and credit risk across the coded classes for a predictor is a monotone increasing or decreasing function. So for example, the more debt an applicant currently carries the greater is the risk of default when additional credit is extended.
Typically, during the score card building process the coarse-coding of predictors is a manual procedure where predictors are considered one-by-one based on a training data set of previous applicants with known quality characteristics (e.g., whether or not the credit was paid back). The result of this process is a set of variables that enter into subsequent predictive modeling, as recoded (coarse-coded) predictors.
When the training data set on which the modeling is based contains a binary indicator variable of "Paid back" vs. "Default", or "Good Credit" vs. "Bad Credit", then Logistic Regression models are well suited for subsequent predictive modeling. Logistic regression yields prediction probabilities for whether or not a particular outcome (e.g., Bad Credit) will occur. Furthermore, logistic regression models are linear models, in that the logit-transformed prediction probability is a linear function of the predictor variable values. Thus, a final score card model derived in this manner has the desirable quality that the final credit score (credit risk) is a linear function of the predictors, and with some additional transformations applied to the model parameter, a simple linear function of scores that can be associated with each predictor class value after coarse coding. So the final credit score is then a simple sum of individual score values that can be taken from the scorecard (as shown earlier).
The term Reject Inference describes the issue of how to deal with the inherent bias when modeling is based on a training dataset consisting only of those previous applicants for whom the actual performance (Good Credit vs. Bad Credit) has been observed; however, there are likely another significant number of previous applicants, that had been rejected and for whom final "credit performance" was never observed. The question is, how to include those previous applicants in the modeling, in order to make the predictive model more accurate and robust (and less biased), and applicable also to those individuals.
This is of particular importance when the criteria for the decision whether or not to extend credit need to be loosened, in order to attract and extend credit to more applicants. This can for example happen during a severe economic downturn, affecting many people and placing their overall financial well being into a condition that would not qualify them as acceptable credit risk using older criteria. In short, if nobody were to qualify for credit any more, then the institutions extending credits would be out of business. So it is often critically important to make predictions about observations with specific predictor values that were essentially outside the range of what would have been previously considered, and consequently is unavailable and has not been observed in the training data where the actual outcomes are recorded.
There are a number of approaches that have been suggested on how to include previously rejected applicants for credit in the model building step, in order to make the model more broadly applicable (to those applicants as well). In short, these methods come down to systematically extrapolating from the actual observed data, often by deliberately introducing biases and assumptions about the expected loan outcome, had the (in actuality not observed) applicant been accepted for credit.
Once a (logistic regression) model has been built based on a training data set, next the validity of the model needs to be assessed in an independent holdout or testing sample, for exactly the same reasons and using the same methods as is typically done in most predictive modeling. All of these methods, graphs, and statistics that are typically computed for this purpose evaluate the improved odds for differentiating the Good Credit applicants from the Bad Credit applicants in the holdout sample, compared to simply guessing or some other methods for making the decision to extend or deny credit.
Useful graphs include the lift chart, Kolmogorov Smirnov chart, and other ways to assess the predictive power of the model. For example, the following graphs shows the Kolmogorov Smirnov (KS) graph for a credit scorecard model.
In this graph, the X axis shows the credit score values (sums), and the Y axis denotes the cumulative proportions of observations in each outcome class (Good Credit vs. Bad Credit) in the hold-out sample. The further apart are the two lines, the greater is the degree of differentiation between the Good Credit and Bad Credit cases in the hold-out sample, and thus, the better (more accurate) is the model.
Once a good (logistic regression) model has been finalized and evaluated, the decision has to be made where to put the cutoff values for extending or denying credit (or where more information should be requested from the applicant to support the application). The most straightforward way to do this is to take as a cutoff the point at which the greatest separation between Good Credit and Bad Credit cases is observed in the hold-out sample, and thus can be expected. However, many other considerations typically enter into this decision.
First, default on a large amount of credit is worse than on a small amount of credit. Generally, the loss or profit associated with the 4 possible outcomes (correctly predicting Good Credit, correctly predicting Bad Credit, incorrectly predicting Good Credit, incorrectly predicting Bad Credit) needs to be taken into consideration, and the cutoff should be selected to maximize the profit based on the model predictions of risk. There are a number of methods and specific graphs that are typically prepared and consulted to decide on final score cutoffs, all of which deal with assessing the expected gains and losses with different cut off values.
Finally, once a score card has been finalized and is being used to extend credit, it obviously needs to be monitored carefully to verify the expected performance. Fundamentally, three things can change:
First, the population of applicants may change with respect to the important (used in the score card) predictors. For example, the applicant pool may become younger, or may show fewer assets than the applicant pool described in the training data from which the score card was built. This will obviously change the proportion of applicants for credit who will be acceptable (given the current scorecard), and this may well change where the best score cutoff should be set. So-called population stability reports are used to capture and track changes in the population of applications (the composition of the applicant pool with respect to the predictors).
Second, the predictions from the scorecard may become increasingly inaccurate. Thus, the accuracy of the predictions from the model must be tracked, to determine when a model should be updated or discarded (and when a new model should be built).
Third, the actual observed rate of default (Bad Credit) may change over time (e.g., due to economic conditions). Such changes will necessitate adjustments to the cutoff values, and perhaps scorecard model itself. The methods and reports that are typically used to track the rate of delinquent loans, and the comparison to the expected delinquency, are called Vintage Analysis or Delinquency Reports/Analysis.
The traditional method for building scorecards briefly outlined above is still widely in use, because it has a number of advantages with respect to the interpretability of models (and thus ease with which decisions regarding whether or not to extend credit can be explained to applicants or regulators); also, they often provide sufficient predictive accuracy making it unnecessary and too costly to develop more complex alternative scorecards (i.e., there is insufficient ROI to use more complex methods).
However, in recent years, general predictive modeling methods have become increasingly common, replacing the traditional logistic-regression based linear sum-of-scores scorecards.
First, a modification of the traditional approach that has become popular replaces for the modeling step the logistic regression model with the Cox Proportional Hazard Model. To summarize, the Cox model (for short) predicts the probability of failure, default, or "termination" of an outcome within a specific time interval. Details regarding the Cox model (and the proportionality-of-hazard assumption, and how to test it) can be found in Survival Analysis. However, effectively this method can be considered an alternative and refinement to logistic regression in particular when "life-times" for credit performance (until default, early pay-off, etc.) are available in the training data. The Cox model is still a linear model though (of the relative hazard rate), i.e., it is linear in the predictors, and the predictions are linear combinations of predictor values. Hence, the predictor pre-processing described above is still useful and applicable (e.g., coarse coding of predictors), as are the subsequent steps for model evaluation, cut-off selection, and so on.
If the accuracy of the prediction of risk is the most important consideration of a scorecard building project (and is associated with most of the expected ROI resulting from the project), then predictive modeling methods and general approximators such as Stochastic Gradient Boosting provide better performance than linear models. The development of advanced data mining predictive modeling algorithms has basically been driven by the desire to detect complex high-order interactions, nonlinearities, discontinuities, and so on among the predictors and their relationships to the outcome of interest, in order to boost predictive accuracy.
Note that automated (computer-based) scoring engines can deliver near-instant feedback to credit applicants, thus negating the advantage of traditional scorecard building methods (as described above). Also, it is still possible to automatically perform analyses subsequent to the credit decision to determine what predictor variable(s) and value(s) influenced most the prediction of risk, and subsequent denial of credit (although those methods are less straightforward), and to provide that feedback to applicants (which is typically required by the laws governing the credit business).
The actual process of building scorecard models using data mining algorithms such as stochastic gradient boosting usually turns out to be simpler than traditional techniques. Since most algorithms are general approximators capable of representing any relationship between predictors and outcomes, and are also relatively robust to outliers, it is not necessary to perform many of the predictor preparation steps such as coarse-coding, etc. All steps subsequent to model building still apply, except that instead of evaluating models and identifying cutoff values based on (sum) scores, the graphs and tables that are typically made to support those analyses can be created based on prediction probabilities from the respective data mining predictive model (or ensemble of models).
Likewise, most of the typical steps after implementation (into "production") of the scorecard also still apply and are necessary to evaluate the performance of the scoring system (as well as population stability, delinquency rates, and accuracy).
The application of scoring models in today's business environment covers a wide range of objectives. The original task of estimating the risk of default has been augmented by credit scoring models to include other aspects of credit risk management: at the pre-application stage (identification of potential applicants), at the application stage (identification of acceptable applicants), and at the performance stage (identification of possible behavior of current customers). Scoring models with different objectives have been developed. They can be generalized into four categories as listed below.
Crook, J. & Banasik, J. (2004).Does Reject Inference Really Improve the Performance of Application Scoring Models? Journal of Banking & Finance, 28, 857-874;
Thomas, L. C., Edelman, D. B., Crook, J. N. (2002). Credit scoring and its applications. Philadelphia, PA: Society for Industrial and Applied Mathematics.
Siddiqi, N. (2005). Credit risk scorecards: Developing and implementing intelligent credit scoring. New York: Wiley.
]]>See also Spotfire Statistica® Cox Proportional Hazards Models module.
]]>Nonparametric methods were developed to be used in cases when the data scientist knows nothing about the parameters of the variable of interest in the population (hence the name nonparametric). In more technical terms, nonparametric methods do not rely on the estimation of parameters (such as the mean or the standard deviation) describing the distribution of the variable of interest in the population. Therefore, these methods are also sometimes called parameter-free methods or distribution-free methods.
Nonparametric methods are most appropriate when the sample sizes are small. When the data set is large it may not make sense to use these statistics due to the central limit theorem. In a nutshell, when the samples become large, then the sample means will follow the normal distribution even if the respective variable is not normally distributed in the population, or is not measured very well. Thus, parametric methods, which are usually much more sensitive are in most cases appropriate for large samples. However, the tests of significance of many of the nonparametric statistics are based on asymptotic (large sample) theory. Therefore, meaningful tests can often not be performed if the sample sizes become too small.
Note that specialized nonparametric tests and statistics are also part of other modules, e.g., Survival Analysis, Process Analysis, and others. All rank order tests can handle tied ranks and apply corrections for small n or tied ranks. The program can handle extremely large analysis designs. All tests are integrated with graphs that include various scatterplots, specialized box-and-whisker plots, line plots, histograms and many other 2D and 3D displays.
]]>Some say that Microservices is SOA done right and it's more of an integration focused design choice. However the Microservices philosophy for breaking up your monolithic applications into a more business focused process has forced your analytical models to be integrated in these microservices as well.
Enterprises and data scientists use "R" for predictive modelling and mining data quite extensively. However, it is difficult to integrate/fit "R" into modern day web development tools and microservice architectures themed for uniform Customer Experience across Digital Channels.
This demo showcases Spotfire®'s ability to publish a statistical model, developed in Spotfire® Enterprise Runtime for R (TERR), as a microservice. We also show you how your Integration applications can consume your statistical models deployed in TERR.
]]>
Original release date: November 26, 2018
Last revised:
CVE-2018-18807
Source: TIBCO Software Inc.
Systems Affected
The following component is affected:
Description
The component listed above contains vulnerabilities which may allow an authenticated user to perform cross-site scripting (XSS) attacks.
Impact
The impact of this vulnerability includes the theoretical possibility that an authenticated user could escalate privileges to gain administrative access to the web interface of the affected component.
CVSS v3 Base Score: 7.6 (CVSS:3.0/AV:N/AC:L/PR:L/UI:R/S:U/C:H/I:L/A:H)
Solution
TIBCO has released updated versions of the affected components which address these issues.
For each affected system, update to the corresponding software versions:
The information on this page is being provided to you on an "AS IS" and "AS-AVAILABLE" basis. The issues described on this page may or may not impact your system(s). Spotfire makes no representations, warranties, or guarantees as to the information contained herein. ANY AND ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, INCLUDING, BUT NOT LIMITED TO, IMPLIED WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT AND FITNESS FOR A PARTICULAR PURPOSE ARE HEREBY DISCLAIMED. BY ACCESSING THIS DOCUMENT YOU ACKNOWLEDGE THAT SPOTFIRE SHALL IN NO EVENT BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES THAT ARISE OUT OF YOUR USE OR FAILURE TO USE THE INFORMATION CONTAINED HEREIN. The information on this page is being provided to you under the terms of your license and/or services agreement with Spotfire, and may be used only for the purposes contemplated by the agreement. If you do not have such an agreement with Spotfire, this information is provided under the Spotfire.com Terms of Use, and may be used only for the purposes contemplated by such Terms of Use.
]]>
Original release date: August 16, 2022
Last revised: ---
CVE-2022-30576
Source: TIBCO Software Inc.
Products Affected
The following component is affected:
Description
The component listed above contains an easily exploitable vulnerability that allows a low privileged attacker with network access to execute Stored Cross Site Scripting (XSS) on the affected system. A successful attack using this vulnerability requires human interaction from a person other than the attacker.
Impact
Successful execution of these vulnerabilities will result in an attacker being able to execute commands with the privileges of the affected user.
CVSS v3.1 Base Score: 8.7 (CVSS:3.1/AV:N/AC:L/PR:L/UI:R/S:C/C:H/I:H/A:N)
Solution
TIBCO has released updated versions of the affected systems which address this issue:
The information on this page is being provided to you on an "AS IS" and "AS-AVAILABLE" basis. The issues described on this page may or may not impact your system(s). Spotfire makes no representations, warranties, or guarantees as to the information contained herein. ANY AND ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, INCLUDING, BUT NOT LIMITED TO, IMPLIED WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT AND FITNESS FOR A PARTICULAR PURPOSE ARE HEREBY DISCLAIMED. BY ACCESSING THIS DOCUMENT YOU ACKNOWLEDGE THAT SPOTFIRE SHALL IN NO EVENT BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES THAT ARISE OUT OF YOUR USE OR FAILURE TO USE THE INFORMATION CONTAINED HEREIN. The information on this page is being provided to you under the terms of your license and/or services agreement with Spotfire, and may be used only for the purposes contemplated by the agreement. If you do not have such an agreement with Spotfire, this information is provided under the Spotfire.com Terms of Use, and may be used only for the purposes contemplated by such Terms of Use.
]]>
Original release date: August 16, 2022
Last revised: ---
CVE-2022-30575
Source: TIBCO Software Inc.
Products Affected
The following component is affected:
Description
The component listed above contains easily exploitable Reflected Cross Site Scripting (XSS) vulnerabilities that allow a low privileged attacker with network access to execute scripts targeting the affected system or the victim's local system.
Impact
Successful execution of these vulnerabilities will result in an attacker being able to execute commands with the privileges of the affected user.
CVSS v3.1 Base Score: 7.3 (CVSS:3.1/AV:N/AC:L/PR:L/UI:R/S:U/C:H/I:H/A:N)
Solution
TIBCO has released updated versions of the affected systems which address this issue:
The information on this page is being provided to you on an "AS IS" and "AS-AVAILABLE" basis. The issues described on this page may or may not impact your system(s). Spotfire makes no representations, warranties, or guarantees as to the information contained herein. ANY AND ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, INCLUDING, BUT NOT LIMITED TO, IMPLIED WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT AND FITNESS FOR A PARTICULAR PURPOSE ARE HEREBY DISCLAIMED. BY ACCESSING THIS DOCUMENT YOU ACKNOWLEDGE THAT SPOTFIRE SHALL IN NO EVENT BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES THAT ARISE OUT OF YOUR USE OR FAILURE TO USE THE INFORMATION CONTAINED HEREIN. The information on this page is being provided to you under the terms of your license and/or services agreement with Spotfire, and may be used only for the purposes contemplated by the agreement. If you do not have such an agreement with Spotfire, this information is provided under the Spotfire.com Terms of Use, and may be used only for the purposes contemplated by such Terms of Use.
]]>]]>
You might find yourself working in a situation where you have Python programmers writing Python scripts and R programmers writing R scripts, but you need to share results from data across the organization. Using TIBCO Enterprise Runtime for R, Python, and a set of available packages, you can span the chasm of programming languages for meaningful results. Optionally, you can create a data function in Spotfire to call this code, and then use the results returned from Python to create a visualization.
In this article, we install the tools and packages in TIBCO Enterprise Runtime for R that are required to pass data to Python, send a script to run in Python, and then get the fitted model back from Python. We compare the results with a model fitted using the TIBCO Enterprise Runtime for the R function lm
.
For this solution, we work in Windows, because Spotfire Analyst is a Windows desktop application. We need to make sure our system meets the requirements, and we have the software applications and packages to run the code.
We are running Windows 10, which is 64-bit system, and we have a modern system with adequate hard disk space, CPU power, and memory.
The software for our solution includes the following.
Spotfire Analyst 7.7, which includes TIBCO Enterprise Runtime for R version 4.2.
Optionally, we can run the TIBCO Enterprise Runtime for R version 4.2 Developer Edition from our installation of RStudio.
Anaconda Python 4.1.1 or later, which includes Python 3.5 (64 bit). The installation requires 631 MB.
Download from https://www.continuum.io/downloads.
We recommend using the Anaconda installation because it includes many packages for data science including numpy, scipy, pandas, and statsmodels.
Note: You must put Python in your path because the code you need to run this example looks only in the directories listed in the environment variable PATH to find the Python executable and DLL files. If you see the following message, check to see that Python is in your PATH.
INFO: Could not find files for the given pattern(s).
Note: To see system requirements for installing the software, see their individual Help topics or Support information.
Spotfire system requirements: <docs.tibco.com/pub/spotfire/general/sr/sr/topics/tibco_cloud_spotfire.html>
Anaconda system requirements: <https://docs.continuum.io/navigator/>
Both TIBCO Enterprise Runtime for R and Python use packages that contain specialized functions to solve specific programming and industry problems. In this case, the packages enable the two systems to connect and to communicate, exchanging data frames.
TIBCO Enterprise Runtime for R uses the following packages (plus all their dependencies) from the Comprehensive R Archive Network (CRAN).
Anaconda manages to find, installing, and building binary Windows packages from available Python package resource sites. Python uses the following packages (plus all their dependencies).
Our solution demonstrates calling Python from TIBCO Enterprise Runtime for R to fit a linear model in Python using the ols
function from the statsmodels package. Fitted values from Python are passed back to TIBCO Enterprise Runtime for R and compared with the fitted values from the lm
function in TIBCO Enterprise Runtime for R.
For analysis in TIBCO Enterprise Runtime for R, statisticians use the data.frame
object to contain the data. For analysis in Python, programmers use pandas, a powerful Python data analysis toolkit, which contains the data structure DataFrame
.
These two object types are not compatible. We can use the CRAN package feather and the Python package feather-format to provide the means to translate the data between the two programming languages while maintaining the structure and integrity of the data.
We use the CRAN package feather to send the data.frame
object from TIBCO Enterprise Runtime for R to Python. We use the feather-format package on the Python side. Python reads in the data as a DataFrame
, adding a column needed by that data structure. After running the script to process the data (fitting the model, in our example), we perform the reverse process, using feather-format in Python to send the data back to TIBCO Enterprise Runtime for R, which reads in the data, with the help of the feather package, as a new data.frame
(with an additional column).
Download the attached .zip archive, feather_format.zip, included with this article. This zip archive contains the feather-format package. Copy the zip archive to the site library for your Anaconda Python installation, and then extract the .whl file it contains. For this example, we provide the .whl archive feather_format-0.3.0-cp35-cp35m-win_amd64.whl.
From a Windows command prompt, install the feather-format package.
pip install feather-format
From the Spotfire menu, click Tools > TERR Tools, and then open the TIBCO Enterprise Runtime for R console.
Note Optionally, you can use RStudio, specifying TIBCO Enterprise Runtime for R as the engine.
install.packages(?feather?) install.packages(?PythonInR?
library("feather") library("PythonInR")
pyConnect
function from the PythonInR package. You should not need to specify a path.
PythonInR::pyConnect() # only needed on Windows
fuel.frame
) to the name ff
.
ff <- Sdatasets::fuel.frame
data.frame
to a feather file, passing in the data set and the temporary path.
tempff <- tempfile("ff") write_feather(ff, path=tempff)
r
before the file name tempff
. This specifies creating a raw string.
PythonInR::pyExecp(paste0("fthrfile = r'", tempff, "'"))
PythonInR::pyExec(' import feather from statsmodels.formula.api import ols df = feather.read_dataframe(fthrfile) linmod = ols(formula="Fuel ~ Weight + Type", data=df).fit() pred = linmod.predict(df) df["Fitted"] = pred feather.write_dataframe(df, fthrfile) ')
The script performs the following tasks.
ff2 <- read_feather(tempff)
lm
and extract the fitted values with the predict
function.
m1 <- lm(Fuel ~ Weight + Type, data = ff) p1 <- predict(m1, ff)
all.equal(unname(p1), ff2$Fitted)
The returned value should be TRUE
, which indicates that the fitted values returned by TIBCO Enterprise Runtime for R and those returned by Python are identical. We can be assured that the code ran correctly and gave us identical results.
]]>
While caching delivers significant performance improvements, there is obviously an overhead in HDFS related to retaining these checkpoints. In this KB article, I discuss simple configuration options to significantly reduce the HDFS footprint of these checkpoints; allowing users to benefit from Chorus caching while being judicious with their HDFS resources.
The Chorus cache can be viewed as being composed of two components:
For each workflow, the user-visible checkpoints can be viewed via the visual workflow editor's action menu. From this dropdown select "Clear Temporary Data". The resulting popup window displays the current checkpoints for the workflow and allows the user to selectively delete unwanted checkpoints.
There are a couple of simple ways to reduce the HDFS overhead associated with maintaining the Chorus cache, which can be easily configured on a per data source, or even per workflow level:
It's also possible to create explicit checkpoints using the Chorus convert operator, which allows the explicit generation of a compressed parquet checkpoint that can be used by downstream operators and can be maintained independently of the Alpine runtime checkpoints. This provides a cost-effective way to support checkpoints at strategic points in the DAG e.g. after feature engineering is completed, and before modeling training. A simple of example of this approach is illustrated below.
]]>auto.arima
from the forecast package.
TIBCO is honored to sponsor an international women's pro cycling team ?Team TIBCO Silicon Valley Bank?
At major TIBCO events, we expose one of the fancy bikes and invite visitors to ride 30 seconds to do their best. This is the ?Team TIBCO Bike Challenge?. TIBCO Data Science stack is used to collect data from a series of sensors (heart rate, crank revolution, the power generated, ?) to display metrics of the performance in real-time and establish a leader board. They are usually encouraged by one of the team members herself!
We have decided to ?augment? this experience.
Now, a spectator can wear Microsoft Hololens glasses and look at the Team TIBCO Bike Challenge while someone is trying to beat a record. The app displays different information depending on the direction you are looking at :
Those data come from the TIBCO Cloud Messaging channel and are received in the Unity Application on the Hololens glasses.
The application is also showing a virtual board with photos of all team members.
When clicking on a photo, an information card moves close to the user, rotates and reveals information about the selected runner.
All information on the team members comes from TIBCO Cloud LiveApps, our case management solution.
Sensor Data from the physical Bike get connected using a lightweight Flogo Implementation running on a Raspberry Pi to TIBCO Cloud Messaging and Stream Data, Spotfire and finally Microsoft Hololens.
Do you have a business use case to prototype for your company? Contact us tibcolabs@tibco.com
]]>As COVID-19 continues to impact people?s lives, we are interested in predicting case trends of the near future. Trying to predict an epidemic is certainly no easy task. While challenging, we explore a variety of modeling approaches and compare their relative performance in predicting case trends. In our methodology, we focus on using data of the past few weeks to predict the data of next week. In this blog, we first talk about the data, how it is formatted and managed, and then describe the various models that we investigated.
The data we use records the number of new cases reported in each county of the U.S everyday. Even though the dataset that we use has much more information, like the number of recovered deaths, etc, the columns that we focus on are ?Cases?, ?Date?, ?State?, and ?County?. We combine the ?State? and ?County? columns together into a single column named ?geo.? After that, we decided to use the 2 weeks from 05/21/2020 to 06/03/2020 as training data, to try to predict the median number of cases from 06/04/2020 to 06/10/2020.
We obtain the following table for training set:
To trim down the data, we remove all counties that have less than 10 cases in the 2 training weeks. The final dataset has 1521 counties in total, which is around half of 3,141 total counties in the US.
The first method that we look into is the Friedman?s Supersmoother method. This is a nonparametric estimator based on local linear regression. Using a series of these regressions, the Projection method is able to generate a smoothed line for our time series data. Below is an example of the smoother on COVID case data from King county in Washington State:
As part of our methods for prediction, we use the last 2 points fitted by the smoother to compute a slope, and then use this slope to predict the number of cases for next week. We find that Friedman?s Supersmoother method is consistent and easy to use because it does not require any parameters. However, we have found that outliers can cause the method to sometimes have erratic behavior.
In this approach, we will use R?s built-in generalized linear model function, glm. GLMs generalize the linear model paradigm by introducing a link function to accommodate data which cannot be fit with a normal distribution. The link function transforms a linear predictor to enable the fit. The type of link function used is specified by the "family" parameter in R's GLM function. As is usual with count data, we use family="poisson". A good introduction can be found at The General Linear Model (GLM): A gentle introduction. One drawback of this approach is that our model could be too sensitive to outliers. To combat against this, we experiment two approaches: Cook?s Distance and Forward Search.
This method is quite straightforward and can be summarized in 3 steps:
Fit a GLM model.
Calculate Cook?s distance, which measures the influence of each data point, for all 14 points. Remove high influence points where data is far away from the fitted line.
Fit a GLM model again based on the remaining data.
One caveat of this method is that the model might not converge in the first step. Though, such cases are rare if we only use 2 weeks of training data. A longer training period may cause the linear predictor structure to prove too limited and require other methods.
The Forward Search method is adapted from the second chapter of the text ?Robust Diagnostic Regression Analysis,? written by Anthony Atkinson and Marco Riani. In Forward Search, we start with a model fit to a subset of the data. The goal is to start with a model that is very unlikely to be built on data that includes outliers. In our case, there are few enough points that we can build a set of models based on every pair of points; or select a random sample to speed up the process. Out of these, we choose the model that best fits the data. Then, the method will iteratively and greedily select data points to add into the model. In each step, the deviance of the data points from the fitted line is recorded. A steep jump in deviance implies that the newly added data is an outlier. Let?s look at this method in further detail:
1. Find initial model using the following steps:
a. Build models on any combination of 2 data points. Since we have 14 data points in total, we will have (14 choose 2) = 91 candidate models.
b. Compute the trimmed sum squared error of the 14 data points based on each fitted model. (Trimmed here means that we only use the 11 data points with least squared error. The intention is to ignore outliers when fitting)
c. The model with least trimmed squared error is selected as the initial model.
For explanation, let?s assume that this red line below was chosen as the initial model. This means that out of all the pairings of two points, this model, more or less, fit the data the best.
2. Next, we walk through and add the data points to our initial model. The process is as follows:
a. Record the deviance of all 14 data points to the existing model
b. Using the points with the lowest deviations from the current model, select the subset with one additional point for fitting the next model in the sequence
c. Using the newly fit model, repeat this process iteratively on the rest of the data
3. We want to evaluate the results of step 2 by looking at the recorded deviance from each substep. Once there seems to be a steep jump in the recorded deviance (above 1.5 SDs), this indicates that we?ve reached an outlier. The steep jump indicates this because, compared to the model before that does not include the outlier, the newly created model with the outlier shifted the model & the recorded deviance significantly?suggesting that this data point is unlike the rest of the data. Additionally, we can presume that the remaining points after the steep jump are more aligned to the skewed data and could also be treated as outliers.
4. Ignoring the outliers identified in step 3, use the remaining data set as training data for the GLM and fit the final model.
Using this method, we will always be able to get a converged model. However, the first step of selecting the best initial model can be very time consuming and the time complexity is O(N^2), where N is the number of data points in the training set. One way to reduce the runtime is to use a sample of possible combinations. In our example, we may try 10 combinations out of the potential 91 combinations.
Our next approach is a simplified version of Moving Average. For this, we first compute the average of the first training week, and then compute the average of the second training week. Here, we assume that the change in number of cases reported each day has a linear relationship. While simple, using a moving average can obtain decent results with strong performance. Below is a visual representation of this method. The first red point represents the average of the first week and the second represents the average of the second week. The slope of the two points is then used to project the following week.
To evaluate these approaches, we used each method to project the median number of cases for the next week based on the case data from the previous two weeks. In addition, we also analyzed the model in terms of a classification problem?taking a look at whether each model was able to correctly identify whether the case trend was increasing or decreasing. Doing this over all of the counties in our dataset, each method now has a list of 1521 projected medians. Comparing the projections to actual data, we can calculate the observed median error for each county across the methods. The table below displays the percentiles of each method?s list of errors.
Note that it is quite common for the Moving Average and Projection methods to predict a negative number of cases. In those situations, we will force them to predict 0. It is common for both GLM models to produce an extremely large number of cases.
Overall, the GLM model, utilizing Cook?s Distance to find outliers, seems to perform best. This method rarely makes negative predictions and predicts reasonably in most cases. The Moving Average method produced the lowest 100th Percentile, or in other terms, achieved the lowest maximum error. The traditional model-based Cooks Distance method improves on the simple Moving Average approach in most cases. All methods, however, suffer from a number of very unrealistic estimates in some cases. Although the Forward Search method is interesting for its innovative approach, in practice it underperforms and is more costly in terms of compute time.
Now, let?s take a look at the results of our classification problem:
Interestingly, the GLM models seemed to not perform as well when looking at the problem in terms of correctly classifying increasing or decreasing trends universally across the counties. There are two metrics in the table above. The ?ROC AUC (>5)? calculates the metric when applied to counties with their previous week?s median case count above 5, whereas the ?ROC AUC (>25)? refers to above 25 cases (ROC AUC, which you can read more about here, is a metric for measuring the success of a binary classification model; values closer to 1 indicate better performance). What you can infer from this is that the more simple Moving Average and Projection methods can do better than the GLMs as a blanket approach. However, when looking at counties with more cases, and likely more significant trends, the GLMs prove better. This supports the finding that GLMs can often have erroneous results on insufficient datasets, but good results on datasets with enough quality data. Additionally, we can say that this is a good example to demonstrate that one-size does not fit all when it comes to modelling. Each method has its benefits and it is important to explore those pros and cons when making a decision on what model to use, and when to use it.
For a more visual look at the results, we can examine some specific cases. Here, we plot results of the methods on three different scenarios: where the number of cases is less than 50, between 50 and 150, and greater than 150.
In general, it can be seen that the more cases there are in the training set, the more accurate and reasonable are the GLM methods. These perform particularly well when there is a clear trend of increasing or decreasing data. However, the GLM does a poor job when the cases are reported on an inconsistent basis (data on some days, but 0?s on others). In such cases, the fitted curve is ?dragged? by the few days of reported data. An example of this is illustrated by the Texas Pecos data in the second figure above.
The Projection method seems to be too subjective to the case counts on the last few days. When there is a sharp decrease on those days, the supersmoother may make negative predictions.
The Moving Average method can be interpreted as a simplified version of the supersmoother. The main difference is that it weights the data of the first and second week equally when making predictions. Therefore, it actually does a slightly better job than the supersmoother.
To further evaluate these approaches, we can extend the length of the training weeks to see how that might affect the performance of each model. The metric used here is similar to the table from the ?Results? section: the median error of the model prediction from the observed data. The results across different training lengths are below:
It is interesting to see that the performance of the GLM-CD model first increases as the length of training data increases (deviances decrease), but later the performance deteriorates once the length of training data is too large.
The following examples illustrate why the performance may deteriorate when the length of training data is too long:
We can see that the GLM model assumes that the trend must be monotone. Once it assumes that the number of cases are increasing, it fails to detect the decreasing number of cases after the outbreak. Therefore, the GLM model is particularly useful when making predictions based solely on the most recent trend. On the contrary, the Projection method is much better at automatically emphasizing the most recent trend, without having to worry about whether the data is monotonic or not, and increasing the length of training data increases its performance in general.
The GLM approach could also be improved by taking into account the presence of a maximum and only using the monotonic portion of the data. For example, the gamlss package and function have a feature that can detect a changepoint and fit a piecewise linear function appropriately. (See Flexible Regression and Smoothing using GAMLSS in R pp 250-253). This would enable us to use a longer time frame when possible in an automated way.
Overall, if we want to use the most recent data for nearcasting based on a GLM model, a 6 week training set seems to be the optimal length. If we were to use a longer period of training data, we might prefer using the Projection method.
While each model has its advantages and disadvantages, using these approaches can help establish reasonable predictions about future trends in COVID data. Not only can these methods be applied in this specific case, but they can also be used for a number of different use cases involving time series data.
The methodologies used in this analysis were created in R and Spotfire. To run these yourself, simply utilize Spotfire's data function, which allows you to run R (or python) scripts within the application. For more information on data functions, check out our community, and if you are interested in learning more about our COVID work and what happens under the hood in Spotfire, read here.
A special thanks to Zongyuan Chen, David Katz, and the rest of the team for their contributions.
Aswani, A. IEOR 265 ? Lecture 6 Local Linear Regression. 2015.
Brown, C. Generalized Linear Models: understanding the link function. 2018.
Carey, G. The General Linear Model (GLM): A gentle introduction. 2013.
Glen, S. Cook's Distance / Cook's D: Definition, Interpretation. 2016.
Stasinopoulos et al. Flexible Regression and Smoothing using GAMLSS in R. 2015.
Wikipedia. Least trimmed squares
Wikipedia. Moving average
Wikipedia. Receiver operating characteristic