The ANOVA/MANOVA module includes a subset of the functionality within the General Linear Models module. The ANOVA/MANOVA module can perform univariate and multivariate analyses of the variance of factorial designs with or without one repeated measures variable. For more complicated linear models with categorical and continuous predictor variables, random effects, and multiple repeated measures factors you need the General Linear Models module.
In the ANOVA/MANOVA module, you can specify all designs in the most straightforward, functional terms of actual variables and levels (not in technical terms, e.g., by specifying matrices of dummy codes), and even less-experienced ANOVA users can analyze very complex designs. Like the General Linear Models module, ANOVA/MANOVA provides three alternative user interfaces for specifying designs: (1) A Design Wizard, that will take you step-by-step through the process of specifying a design, (2) a simple dialog-based user-interface that will allow you to specify designs by selecting variables, codes, levels, and any design options from well-organized dialogs, and (3) a Syntax Editor for specifying designs and design options using keywords and a common design syntax. Computational methods. The program will use, by default, the sigma restricted parameterization for factorial designs, and apply the effective hypothesis approach (see Hocking, 1980) when the design is unbalanced or incomplete. Type I, II, III, and IV hypotheses can also be computed, as can Type V and Type VI hypotheses that will perform tests consistent with the typical analyses of fractional factorial designs in industrial and quality-improvement applications.
Historical note:
The ANOVA/MANOVA module is not limited in any of its computational routines for reporting results, so the full suite of detailed analytic tools available in the General Linear Models module is also available. Results include summary ANOVA tables, univariate and multivariate results for repeated measures factors with more than 2 levels, the Greenhouse-Geisser and Huynh-Feldt adjustments, plots of interactions, detailed descriptive statistics, detailed residual statistics, planned and post-hoc comparisons, testing of custom hypotheses and custom error terms, detailed diagnostic statistics and plots (e.g., histogram of within-cell residuals, homogeneity of variance tests, plots of means versus standard deviations, etc.).
In one line of literature, the analysis of multi-factor ANOVA designs is generally discussed as the Sigma-restricted model. The ANOVA parameters are constrained to sum to zero. In this manner, given k levels of a factor, the k-1 parameters (corresponding to the k-1 degrees of freedom) can readily be estimated (e.g., Lindeman, 1974, Snedecor and Cochran, 1989, p. 322). Another tradition discusses ANOVA in the context of the unconstrained and thus over-parameterized general linear model (e.g., Kirk, 1968). The results for mixed random and fixed effect models can be different applying the two approaches.
This module uses, by default, the means model approach. It will construct F-tests for mixed models that are consistent with the sigma restricted model. This is an ANOVA "tradition" most commonly discussed in statistics textbooks in the biological and social sciences.
Note: The Variance Components & Mixed Model ANOVA ANCOVA module uses the over-parameterized model.
]]>When the "factorial degree" of important association rules is not known ahead of time...
Then pivot tables and cross-tabulations are too cumbersome to use or may not be applicable. For example, a three-way association would not be visible in a cross-tabulation.
The a priori algorithm implemented in Spotfire Statistica® Association Rules automatically detects the relationships ("crosstabulation tables") that are important (i.e., crosstabulation tables that are not sparse, not containing mostly zeros), but also determines the factorial degree of the tables that contain the important association rules.
The Association Rules module can find rules of the kind If X then (likely) Y where X and Y can be single values, items, words, etc., or conjunctions of values, items, words, etc. (e.g., if (Car=Porsche and Gender=Male and Age<20) then (Risk=High and Insurance=High)). The program can be used to analyze simple categorical variables, dichotomous variables, and/or multiple response variables. The algorithm will determine association rules without requiring the user to specify the number of distinct categories present in the data, or any prior knowledge regarding the maximum factorial degree or complexity of the important associations. In a sense, the algorithm will construct crosstabulation tables without the need to specify the number of dimensions for the tables or the number of categories for each dimension. Hence, this technique is particularly well suited for data and text mining of huge databases.
For additional information see Agrawal and Swami, 1993; Agrawal and Srikant, 1994; Han and Lakshmanan, 2001; Witten and Frank, 2000.
]]>Detailed discussions of the methods can be found in Anderson (1976), Box and Jenkins (1976), Kendall (1984), Kendall and Ord (1990), Montgomery, Johnson, and Gardiner (1990), Pankratz (1983), Shumway (1988), Vandaele (1983), Walker (1991), and Wei (1989).
Common smoothing options are available to "bring out" the major patterns; weighted, prior, n-points moving average, simple exponential, 4253 filters, etc.
Types of analyses are single series ARIMA, exponential smoothing, interrupted ARIMA, Fourier, Census I (seasonal decomposition), Census II (x-11 seasonal), and distributed lag.
Time Series problems can also be analyzed with neural networks. Neural networks are not included in Spotfire Statistica® Desktop or Spotfire Statistica® Analyst. You must purchase Spotfire Statistica® Modeler, Spotfire Statistica® Data Scientist, or Spotfire Statistica® Comprehensive to receive neural networks.
]]>
A wide variety of options are offered to control the layout and format of the tables. For example, for tables involving multiple response variables or multiple dichotomies, marginal counts and percentages can be based on the total number of respondents or responses, multiple response variables can be processed in pairs, and various options are available for counting (or ignoring) missing data. Frequency tables can also be computed based on user-defined logical selection conditions (of any complexity, referencing any relationships between variables in the dataset) that assign cases to categories in the table.
The program can display cumulative and relative frequencies, Logit- and Probit-transformed frequencies, normal expected frequencies (and the Kolmogorov-Smirnov, Lilliefors, and Shapiro-Wilks' tests), expected and residual frequencies in crosstabulations, etc. Available statistical tests for crosstabulation tables include the Pearson, Maximum-Likelihood and Yates-corrected Chi-squares; McNemar's Chi-square, the Fisher exact test (one- and two-tailed), Phi, and the tetrachoric r. Additional statistics are Kendall's tau (a, b), Gamma, Spearman r, Sommer's D, uncertainty coefficients, etc.
Reporting tables can calculate valid N, the sum of weights, sum, mean, median, mode, standard deviation, min, max, coefficient of variation, distinct count, geometric mean, Grubbs test, harmonic mean, percentiles, skewness, trimmed means, etc.
Graphical options include simple, categorized (multiple), and 3D histograms, cross-section histograms (for any "slices" of the one-, two-, or multi-way tables), and many other graphs including a unique "interaction plot of frequencies" that summarizes the frequencies for complex crosstabulation tables (similar to plots of means in ANOVA). Cascades of even complex (e.g., multiple categorized, or interaction) graphs can be interactively reviewed.
An additional method to analyze crosstabulation tables is provided in the Spotfire Statistica® Generalized Linear Nonlinear Models module.
]]>Fraud detection is a topic applicable to many industries including banking and financial sectors, insurance, government agencies and law enforcement, and more. Fraud attempts have seen a drastic increase in recent years, making fraud detection more important than ever. Despite efforts on the part of the affected institutions, hundreds of millions of dollars are lost to fraud every year. Since relatively few cases show fraud in a large population, finding these can be tricky.
In banking, fraud can involve using stolen credit cards, forging checks, misleading accounting practices, etc. In insurance, 25% of claims contain some form of fraud, resulting in approximately 10% of insurance payout dollars. Fraud can range from exaggerated losses to deliberately causing an accident for the payout. With all the different methods of fraud, finding it becomes harder still.
Data mining and statistics help to anticipate and quickly detect fraud and take immediate action to minimize costs. Through the use of sophisticated data mining tools, millions of transactions can be searched to spot patterns and detect fraudulent transactions.
An important early step in fraud detection is to identify factors that can lead to fraud. What specific phenomena typically occur before, during, or after a fraudulent incident? What other characteristics are generally seen with fraud? When these phenomena and characteristics are pinpointed, predicting and detecting fraud becomes a much more manageable task.
Using sophisticated data mining tools such as decision trees (Boosting trees, Classification trees, CHAID and Random Forests), machine learning, association rules, cluster analysis and neural networks, predictive models can be generated to estimate things such as the probability of fraudulent behavior or the dollar amount of fraud. These predictive models help to focus resources in the most efficient manner to prevent or recuperate fraud losses.
The notion of "fraud" implies an intention on the part of some party or individual presumably planning to commit fraud. From the perspective of the target of that attempt, it is usually less important whether or not intentional fraud has occurred, or some erroneous information was introduced into the credit system or process evaluating insurance claims, etc. So from the perspective of the credit, retail, insurance, or similar business the issue is rather whether or not a transaction that will be associated with loss has occurred or is about to occur if a claim can be subrogated, rejected, or funds recovered somehow, etc.
While the techniques briefly outlined here are often discussed under the topic of "fraud detection", other terms are also frequently used to describe this class of data mining (or predictive modeling; see below) application, as "opportunities for recovery", "anomaly detection", or using similar terminology.
From the (predictive) modeling or data mining perspective, the distinction between "intentional fraud" vs. "opportunities for recovery" or "reducing loss" is also mostly irrelevant, other than that the specific perspective of how losses occur may guide the search for relevant predictors (and databases where to find relevant information). For example, intentional fraud may be associated with unusually "normal" data patterns as intentional fraud usually aims to stay undetected - and thus hide as an average/common transaction; other opportunities for recovery of loss (other than intentional fraud), however, may simply involve the detection of duplicate claims or transactions, the identification of typical opportunities for subrogation of insurance claims, correctly predicting when consumers are accumulating too much debt, and so on.
In the following paragraphs, the "fraud" term will be used as a short hand to reference the types of issues briefly outlined above.
One way to approach the issue of fraud detection is to consider it a predictive modeling problem, of correctly anticipating a (hopefully) rare event. If historical data are available where fraud or opportunities for preventing loss have been identified and verified, then the typical useful predictive modeling workflow can be directed at increasing the chances to capture those opportunities.
In practice, for example, many insurance companies support investigative units, to evaluate opportunities for saving money on claims that were submitted. The goal is to identify a screening mechanism so that the expensive detailed investigation into claims (requiring highly experienced personnel) is selectively applied to claims where the overall probability for recovery (detecting fraud, opportunities to save money, etc.; see the introductory paragraphs) is generally high. Thus, with an accurate predictive model for detecting likely fraud, subsequent "manual" resources required to investigate a claim in detail are generally more likely to reduce loss.
The approach to predicting the likelihood of fraud as described above essentially comes down to a standard predictive modeling problem. The goal is to identify the best predictors and a validated model providing the greatest Lift to maximize the likelihood that the observations predicted to be fraudulent will indeed be associated with fraud (loss). That knowledge can then be used to reject applications for credit or to initiate a more detailed investigation into an insurance claim, credit application, purchase via credit card, etc.
As most types of fraud are sporadic events (less than 30% of cases are fraud), the stratified sampling technique can be used to oversample from the fraudulent group. This technique aids in model building. With more cases from the group of interest, data mining models are better able to find the patterns and relationships to detect fraud.
Depending on the base rate of fraudulent events in the training data it may be necessary to apply appropriate stratified sampling strategies to create a good data set for model building, i.e., a data file where fraudulent vs. non-fraudulent observations are represented with approximately equal probability (as described in stratified random sampling, model building is usually easiest and most successful when the data presented to the learning algorithms include exemplars of all relevant classes with about equal proportions;).
Another use case and problem definition of "fraud detection" presents itself rather as an "intrusion" or anomaly detection problem. Such cases arise when there is no good training (historical) data set that can be unambiguously assembled where known fraudulent and non-fraudulent observations are clearly identified.
For example, consider again the simple case of an insurance use case. A claim is filed against a policy, which given existing procedures (and rule engines, see below) triggered a further investigation that resulted in some recovery for the insurance company in a small proportion of cases. If one were to assemble a training dataset of all claims, some of which were further investigated and some recovery occurred or perhaps fraud was uncovered, then any modeling of such a dataset would likely capture to a large extent the rules and procedures that led to the investigation in the first place. (However, perhaps a more useful training dataset could be constructed only from those data referred to the investigative unit for further evaluation.). In other common cases, there is no "investigative unit" in the first place, and the data available for analysis do not contain a useful indicator of fraud or potentially recoverable loss, or potential savings.
In such cases, the available information simply consists of a large and often complex data set of claims, applications, purchases, etc. with now clear outcome "indicator variable" that would be useful for predictive modeling (and supervised learning). In those cases, another approach is to effectively perform unsupervised learning to identify in the data set (or data stream) "unusual observations" that are likely associated with fraud, unusual conditions, etc.
For example, consider the typical health insurance case. A large number of very (in fact extremely) diverse claims are filed, usually encoded via a complex and rich coding scheme to capture various health issues and common and "approved" or "accepted" therapies. Also, with each claim there can be the expectation of obvious subsequent claims (e.g., a hip replacement requires subsequent rehabilitation), and so on.
The field of anomaly detection has many applications in industrial process monitoring, to identify "outliers" in multivariate space that may indicate a process problem. A good example of such an application for monitoring multivariate batch processes is discussed in the chapter on Multivariate Process Monitoring for batch processes, using Partial Least Squares methods. The same logic and approach can fundamentally be applied for fraud detection in other (non-industrial-process) data streams.
To return to the example of a health care, assume that a large number of claims are filed and entered into a database every day. The goal is to identify all claims where reduced payments (less than they claim) are due, including outright fraudulent claims. How can that be achieved?
First, obviously there are a set of complex rules that should be applied to identify inappropriately filed claims, duplicate claims, and so on. Typically, complex rules engines are in place that will filter all claims to verify that they are formally correct, i.e., consistent with the applicable policies and contracts. Duplicate claims will also have to be checked.
What remains are formally legitimate claims which nonetheless could (and probably do) contain fraudulent claims. To find those it is necessary to identify any configurations of data fields associated with the claims that would allow us to separate the legitimate claims from those that are not. Of course, if no such patterns exist in the data, then nothing can be done; however, if such patterns do exist then the task becomes to find those "unusual" claims.
There are many ways to define what might constitute an "unusual" claim. But basically, there are two ways to look at this problem: Either by identifying outliers in the multivariate space, i.e., unusual combinations of data fields that are unlike typical claims, or by identifying "in-liers", that is, claims that are "too typical", and hence suspect of having been "made up".
This task is one of unsupervised learning. The basic data analysis (data mining) approach is to use some form(s) of clustering methods (e.g., k-means clustering, and then use those clusters to score (assign) new claims: If a new claim cannot be assigned with high confidence to a particular cluster of points in the multivariate space made up of numerous parameters (information available with each claim) then the new claim is "unusual" and an outlier of sorts, and should be considered for further evaluation; if a new claim can be assigned to a particular cluster with very-high confidence, and perhaps, if a large number of claims from a particular source all share that characteristic (i.e., are "in-liers"), then again these claims might warrant further evaluation since they are uncharacteristically "normal".
It should be noted that similar techniques are useful in all applications where the task is to identify atypical patterns in data or patterns that are suspiciously too typical. Such use cases exist in the area of intrusion (to networks) detection, as well as many industrial multivariate process monitoring applications where complex manufacturing processes involving a large number of critical parameters must be monitored continuously to ensure overall quality and system health.
The previous paragraphs briefly mentioned rule engines as one component in fraud detection systems. In fact, they typically are the first and most critical component: Usually, the expertise and experience of domain experts can be translated into formal rules (that can be implemented in an automated scoring system) for pre-screening data for fraud or the possibility of reduced loss. Thus, in practice, the fraud detection analyses and systems based on data mining and predictive modeling techniques serve as the method for further improving the fraud detection system in place, and their effectiveness will be judged against the default rules created by experts. This also means that the final deployment method of the fraud detection system, e.g., in an automated scoring solution, needs to accommodate both sophisticated rules and possibly complex data mining models.
Text mining methods are used in conjunction with all available numeric data to improve fraud detection systems (e.g., predictive models). The motivation simply is to align all information that can be associated with a record of interest (insurance claim, purchase, credit application), and to use that information to improve the predictive accuracy of the fraud detection system. Basically, the approaches described here are applicable in the same way when used in conjunction with text mining methods, except that the respective unstructured text sources would first have to be pre-processed and "numericized" so that they can be included in the data analysis (predictive modeling) activities.
]]>The default missing data code is assigned to the new (transformed) variables for a case if:
The fitting of higher-order polynomials of an independent variable with a mean not equal to zero can create difficult numerical problems. Specifically, the polynomials will be highly correlated due to the mean of the primary independent variable. With large numbers (e.g., Julian dates), this problem is very serious, and if proper protections are not put in place, can cause wrong results. The solution is to "center" the independent variable (sometimes, this procedure is referred to as "centered polynomials"), i.e., to subtract the mean, and then compute the polynomials.
See the classic text by Neter, Wasserman, & Kutner (1985, Chapter 9), for a detailed discussion of this issue and analyses of polynomial models in general. Note that Statistica automatically checks for very large numbers created in the process of computing the polynomials and issues a warning message to alert you of potential multicollinearity problems.
]]>The Quality Control module is used to monitor on-going production processes for quality characteristics. Flexible implementations of Pareto charts, X-bar charts, R charts, S charts, S-squared (variance) charts, C charts, Np charts (binomial counts), P charts (binomial proportions), U charts, CuSum (cumulative sum) charts, moving range charts, runs charts (for individual observations), regression control charts, MA charts (moving average), and EWMA charts (exponentially-weighted moving average) are provided. These charts may be based on user-specified values or on parameters (e.g., means, ranges, proportions, etc.) computed from the data.
Most of the variable control charts can be constructed from single observations (e.g., moving range chart) as well as from samples of multiple observations. Control limits can be specified in terms of multiples of sigma (e.g., 3 * sigma), in terms of normal or non-normal (Johnson-curves) probabilities (e.g., p=.01, .99), or as constant values. For unequal sample sizes, control charts can be computed with variable control limits or based on standardized values. For most charts, multiple sets of specifications can be used in the same chart (e.g., control limits for all new samples can be computed based on a subset of previous samples, etc.). Runs tests, such as the Western Electric Run Rules, are easily integrated into the QC chart.
Analytics and visualizations are available for use with the "Define, Measure, Analyze, Improve, Control" (DMAIC) methodology.
Note: For detailed descriptions of quality control charts and extensive annotated examples, see Buffa (1972), Duncan (1974) Grant and Leavenworth (1980), Juran (1962), Juran and Gryna (1970), Montgomery (1996), Shirland (1993), or Vaughn (1974). Two excellent introductory texts with a "how-to" approach are Hart & Hart (1989) and Pyzdek (1989). There are also two German-language texts on this subject; Rinne and Mittag (1995) and Mittag (1993).
]]>This module provides analytics for:
Spotfire Statistica® Server needs to added onto this product to provide the following benefits:
Statistica Extract, Transform, & Load (ETL) combines the capabilities of the Statistica system for efficient processing of data from standard databases (Microsoft SQL, Oracle) as well as specialized process databases with the PI Connector (e.g., OSI Pi), with Statistica's data processing capabilities for data filtering, aggregation, and analyses. As mentioned above, Statistica ETL can be combined with the capabilities of Statistica Server for an advanced Statistical process monitoring solution. This solution can support highly specialized data warehouses that can integrate time-stamped parameter data for multiple process steps with quality, rework, and outcome data.
The Statistica ETL module provides unique capabilities for processing and merging data, in particular process data that are difficult to manage using standard database tools.
In order to monitor ongoing continuous processes, such as chemical or pharmaceutical manufacturing, power generation, refining, and so on, it is necessary that critical process parameters be recorded into a process "historian" at regular time intervals. Dedicated high-performance databases, such as the OSI Soft's PI database, are typically deployed to provide efficient high-frequency data recording capabilities. However, to make such data available for useful data analyses, e.g., for root-cause analyses or process monitoring, it is necessary that such data are aggregated and aligned, for example, with outcome data.
Statistica ETL provides simple tools to automate the process of aligning time-stamped process data with other data sources, such as process data collected at different time intervals, or only collected once per part, ID, batch, etc.
The manufacture of pharmaceuticals and chemicals often involves the processing of batches of materials through multiple steps, where in each step some maturation of the batch is recorded. The resulting data, recorded into some laboratory information management system (LIMS) consist of time-stamped process data, organized by batch ID. In order to make such data available for useful data analyses, it is necessary to transform the time-stamps into elapsed-within-process-step times and to normalize the data so that for each batch a comparable number of elapsed time recordings are available for analyses.
Statistica ETL provides efficient tools for processing batch-time data, achieving equal batch "lengths," and unstacking such data to make them available for subsequent analyses and process monitoring of the maturation process (see also Statistica MSPC for details).
The aggregation of real process data (e.g., time-stamped one-minute-interval data to align with hourly data) usually requires the application of aggregation methods that go far beyond the capabilities of standard database tools. For example, time-stamped data may include outliers, or may be very "noisy," thus hiding important trends or changes in trends.
Statistica ETL provides numerous tools and methods for aggregating and/or smoothing data so that meaningful subsequent process monitoring methods (e.g., for change-point or trend detection) can be applied to robust or smoothed estimates of process averages within aggregated time intervals.
Complex processes, such as the manufacture of semiconductors, pharmaceutical manufacturing, etc. require complex data storage, suited to the specific nature of the process that is to be recorded and monitored. Therefore, it is common that multiple separate databases or data sources, such as automatically created (from gages) CSV files, data from OSI PI, assay data from a LIMS system, etc., must be aggregated and aligned, to enable meaningful root cause analyses of problems, or comprehensive process monitoring.
Statistica ETL provides tools for configuring complex data alignment tasks of multiple diverse data sources into a single ETL object, which can be deployed into Statistica Enterprise, to be applied ad-hoc or as scheduled ETL tasks, to support a dedicated data warehouse that maintains validated and aligned data for comprehensive process monitoring and optimization.
The Transformation capabilities of Statistica ETL go far beyond those available in standard database or querying tools, and will allow you to build dedicated specialized data warehouses to optimize your processes without the need to program custom-applications in-house. Statistica ETL is the one-stop solution for creating data warehouses with automated simple and sophisticated analytic capabilities that will allow you to derive the full value from the data that you are collecting!
The Statistica ETL solution will automate the process of validating and aligning multiple diverse data sources into data tables suitable for ad-hoc or automated analyses. When deployed inside the Statistica Enterprise framework, data can be written back to dedicated database tables, or to Statistica data tables, to provide analysts or process engineers convenient access to real-time performance data, without the need to perform tedious data preprocessing or cleaning before any actionable information can be extracted.
]]>PCA is a dimensionality reduction and data diagnostics tool. It helps with outlier detection. It provides insight into how the variables contribute to the observations and correlate to one another. PCA is particularly useful for process monitoring and quality control as they provide us with effective and convenient analytic and graphic tools for detecting abnormalities that may rise during the development phase of a product. PCA data diagnostics also play an important role in batch processing where the quality of the end product can only be ensured through constant monitoring during its production phase.
PLS is a popular method for modeling industrial applications. It was developed by Wold in the 1960s as an economic technique. PLS quickly expanded to use in chemical engineering, scientific research and manufacturing. Although the PLS technique is primarily designed for handling regression problems, this module enables you to handle classification tasks. You will find this dual capability useful in many applications of regression or classification, especially when the number of predictor variables is large.
Models can be built with a pre-set number of principal components. Or you use the automated cross-validation technique to determine the complexity of your model, i.e., to determine the optimal number of components. You can also add or remove components from your existing model in order to compare the performance of various models with different degrees of complexity using the same data set, all in one analysis.
Data preprocessing options that are available; scaling, mean centering, time-wise batch unfolding, and batch-wise unfolding.
MSPC was developed to monitor multiple variables simultaneously for a production process (biochemicals, cement, fertilizers, food, paint, perfume, pharmaceuticals, petroleum products, polymers, pulp, semiconductors, etc...). Common goals for users are:
Use cases for this functionality are:
Use the product to:
DES is designed for regulated industries like manufacturing, pharma, insurance, financial services, and healthcare organizations. Build complex rules on the forms to verify and validate data. Use roles-based security to control access. Configure the workflow to manage data entry, validate data, send email notifications, and approve. Collect electronic signatures that comply with FDA 21 CFR Part 11. Analyze the collected data and take action.
]]>
The R Consortium, of which Spotfire® is a proud member, recently posted a summary of "Best Practices for Using R Securely".
We encourage anyone using open source R (whether with Spotfire® products or not) to review those Best Practices, which essentially recommend a user download R and R packages from a secure server using an encrypted HTTPS connection.
Spotfire® Enterprise Runtime for R is a commercial product, and downloaded either from our secure Spotfire® Product Download site (for customers who purchase Spotfire® Enterprise Runtime for R) or from the TIBCO Access Point (TAP) site (for members of the Community who are using the free Spotfire® Enterprise Runtime for R Developer's Edition).
Both sites use HTTPS.
Customers downloading Spotfire® Enterprise Runtime for R from the Spotfire® Product Download site should confirm the MD5 checksums following the same process as in detailed in the Best Practices.
By default, Spotfire® Enterprise Runtime for R will use https for secure file download if a secure mirror is specified. There is no need to do any special configuration of Spotfire® Enterprise Runtime for R.
We recommend Spotfire® Enterprise Runtime for R users follow this recommendation, and always download CRAN packages from a secure mirror. The Best Practices post includes a list of CRAN sites that use HTTPS.
Performing power analysis and sample size estimation is an important aspect of experimental design, because, without these calculations, the sample size may be too high or too low. If the sample size is too low, the experiment will lack the precision to provide reliable answers to the questions it is investigating. If the sample size is too large, time and resources will be wasted, often for minimal gain.
Suppose you are planning a 1-Way ANOVA to study the effect of a drug. Prior to planning the study, you find that there has been a similar study previously. This particular study had 4 groups, with N = 50 subjects per group, and obtained an F-statistic of 15.4. From this information, you can (a) gauge the population effect size with an exact confidence interval, and (b) use this information to set a lower bound to the appropriate sample size in your study.
Other features available with this module:
For additional information on noncentrality interval estimation see Steiger and Fouladi (1997).
]]>Computationally, discriminant function analysis is very similar to analysis of variance (ANOVA) and multivariate analysis of variance (MANOVA).
The module performs forward stepwise analysis where variables are evaluated and added to the model if they contribute the most discrimination between groups. The backward stepwise analysis is also available where all variables are added to the model. Variables with the least amount of contribution between groups are removed. Or the user can enter user-specified blocks of variables.
In the case of a single variable, the significance test of whether or not a variable discriminates between groups is the F-test. F is essentially computed as the ratio of the between-groups variance in the data over the pooled average within-group variance. If the between-group variance is significantly larger then there must be significant differences between means.
Usually, you include several variables in a study in order to see which one(s) contributes to the discrimination between groups. This scenario creates a matrix of total variances/covariances and a matrix of pooled within-group variances/covariances. The user can compare those two matrices via multivariate F-tests in order to determine whether or not there are any significant differences (with regard to all variables) between groups.
Output includes the Wilks' lambdas, partial lambdas, F to enter (or remove), the p levels, the tolerance values, and the R-square.
Canonical Analysis can also be performed to report the raw and cumulative eigenvalues for all roots, and their p levels, the raw and standardized discriminant (canonical) function coefficients, the structure coefficient matrix (of factor loadings), the means for the discriminant functions, and the discriminant scores for each case.
In the two-group case, the user fits a linear equation:
Group = a + b_{1}*x_{1 }+ b_{2}*x_{2} + ... + b_{m}*x_{m}
a is a constant and b_{1} through b_{m} are regression coefficients. The interpretation of the results of a two-group problem is straightforward and closely follows the logic of multiple regression: Those variables with the largest regression coefficients are the ones that contribute most to the prediction of a group membership.
See Jennrich, 1977, for another description for of the computations involved.
]]>Available techniques are eigenvalues, correlation coefficients, significance test of the canonical correlations, canonical weights, canonical scores, factor structure, variance, redundancy, and practical significance. In equation form, redundancy is:
Redundancy_{left} = [s(loadings_{left}^{2})/p]*R_{c}^{2}
Redundancy_{right} = [s(loadings_{right}^{2})/q]*R_{c}^{2}
In these equations, p denotes the number of variables in the first (left) set of variables, and q denotes the number of variables in the second (right) set of variables; R_{c}^{2 }is the respective squared canonical correlation.
Note that you can compute the redundancy of the first (left) set of variables given the second (right) set, and the redundancy of the second (right) set of variables, given the first (left) set. Because successively extracted canonical roots are uncorrelated, you could sum up the redundancies across all or only the first significant roots. This provides a single index of redundancy as proposed by Stewart and Love, 1968.
]]>Experimentation is sometimes mistakenly thought to involve only the manipulation of levels of the independent variables and the observation of subsequent responses on the dependent variables. Independent variables whose levels are determined or set by the experimenter are said to have fixed effects.
There is a second class of effects, however, which is often of great interest to the researcher. Random effects are classification effects where the levels of the effects are assumed to be randomly selected from an infinite population of possible levels. Many independent variables of research interest are not fully amenable to experimental manipulation, but nevertheless can be studied by considering them to have random effects.
Here are examples of why these techniques are helpful.
For more information see Milliken and Johnson (1992).
]]>But there are other ways to think about operationalizing an analytic project. For example:
Workspace is a flexible graphical user interface used for storing and automating steps of the whole data analysis process. It is in the form of a white canvas where you can drag and drop nodes and logically interconnect them. Each node represents one functionality of the software. By inserting your nodes and connections you are creating your data analysis process. In the workspace you can load the data, clean the data, merge the data, prepare data for data analysis, use statistical methods to get insight from the data, you can build predictive models and in the end save the results. More info about Statistica® Workspace can be found here.
Statistica® Workbook is the default way of managing output. Each output document is stored as a tab in the workbook. Output can be organized into hierarchies of folders or document nodes using a tree view. Technically speaking, workbooks are ActiveX document containers. This provides compatibility with a variety of file formats (text files, Microsoft Office documents), which can be easily inserted into workbooks and in-place edited.
Workbooks can be saved and shared as HTML or an Excel workbook (.xlsx).
All the graphs within a workbook can be saved into a Word document or PowerPoint presentation.
Statistica® Report offers a more traditional way of handling output where each object (spreadsheet, graph) is displayed sequentially in a word processor-style document. A Statistica® report is saved as a .str file. A report can also be saved as .html, .rtf, .pdf, .txt, .xml.
When sending spreadsheet analytical results to Word, Statistica® will take advantage of Word's table editing facility, and convert the spreadsheet to a table. For multi-page spreadsheets, you can control where to break the rows and columns. These spreadsheets will be broken by columns such as will be allowed without exceeding the page width. All rows for a given set of columns will be rendered before the next set of spreadsheet columns is rendered in the Word document. This solution enables the presentation of spreadsheets in Word that are natively editable in Word, displays the entire contents of the spreadsheet, and prints and paginates correctly.
]]>MDS includes a full implementation of nonmetric multidimensional scaling. Matrices of similarities, dissimilarities, or correlations between variables (i.e., "objects" or cases) can be analyzed. The starting configuration can be computed via principal components analysis or specified by the user. An iterative procedure is used to minimize the stress value and the coefficient of alienation. The output includes the values for the raw stress (raw F), Kruskal stress coefficient S, and the coefficient of alienation. The goodness of fit can be evaluated via Shepard diagrams with d-hats and d-stars.
Comprehensive introductions to the computational approach used in this module can be found in Borg and Lingoes (1987), Borg and Shye (1995), Guttman (1968), Schiffman, Reynolds, and Young (1981), Young and Hamer (1987), and in Kruskal and Wish (1978). D-stars are calculated via a procedure known as the rank-image permutation procedure (see Guttman, 1968; or Schiffman, Reynolds, & Young, 1981, pp. 368-369). D-hats are calculated via a procedure referred to as the monotone regression transformation procedure (see Kruskal, 1964;Schiffman, Reynolds, & Young, 1981, pp. 367-368).
The following simple example may demonstrate the logic of Multidimensional Scaling analysis. Suppose we take a matrix of distances between major US cities from a map. We then analyze this matrix, specifying that we want to reproduce the distances based on two dimensions. As a result of the MDS analysis, we would most likely obtain a two-dimensional representation of the locations of the cities, that is, we would basically obtain a two-dimensional map.
In general, MDS attempts to arrange "objects" (major cities in this example) in a space with a particular number of dimensions (two-dimensional in this example) so as to reproduce the observed distances. As a result, we can "explain" the distances in terms of underlying dimensions. In our example, we could explain the distances in terms of the two geographical dimensions: north/south and east/west.
As in factor analysis, the actual orientation of axes in the final solution is arbitrary. We could rotate the map in any way we want. The distances between cities remain the same. Thus, the final orientation of axes in the plane or space is mostly the result of a subjective decision by the data scientists, who will choose an orientation that can be most easily explained.
Note: MDS and factor analysis seem similar, they are fundamentally very different. Factor analysis requires that the underlying data are distributed as multivariate normal, and that the relationships are linear. MDS imposes no such restrictions.
As long as the rank-ordering of distances (or similarities) in the matrix is meaningful, MDS can be used. In terms of resultant differences, factor analysis tends to extract more factors (dimensions) than MDS. Therefore MDS often yields more readily, interpretable solutions. Most importantly, however, MDS can be applied to any kind of distances or similarities, while factor analysis requires us to first compute a correlation matrix. MDS can be based on subjects' direct assessment of similarities between stimuli, while factor analysis requires subjects to rate those stimuli on some list of attributes (for which the factor analysis is performed).
]]>In its simplest form, a linear model specifies the (linear) relationship between a dependent (or response) variable Y, and a set of predictor variables, the X's, so that
Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + ... + b_{k}X_{k}
In this equation b_{0} is the regression coefficient for the intercept and the b_{1} values are the regression coefficients (for variables 1 through k) computed from the data.
For example, one could estimate (i.e., predict) a person's weight as a function of the person's height and gender. You could use linear regression to estimate the respective regression coefficients from a sample of data, measuring height, weight, and observing the subjects' gender. For many data analysis problems, estimates of the linear relationships between variables are adequate to describe the observed data, and to make reasonable predictions for new observations.
However, there are relationships that cannot adequately be summarized by a simple linear equation for two major reasons:
Different methods for automatic model building are available. Specifically, forward entry, backward removal, forward stepwise, and backward stepwise procedures can be performed, as well as best-subset search procedures. In forward methods of selection of effects to include in the model (i.e., forward entry and forward stepwise methods), score statistics are compared to select new (significant) effects. The Wald statistic can be used for backward removal methods (i.e., backward removal and backward stepwise, when effects are selected for removal from the model).
The best subsets search method can be based on three different test statistics: the score statistic, the model likelihood, and the AIC (Akaike Information Criterion). Note that, since the score statistic does not require iterative computations, best subset selection based on the score statistic is computationally fastest, while selection based on the other two statistics usually provides more accurate results.
For additional information about generalized linear model, see Dobson (1990), Green and Silverman (1994), or McCullagh and Nelder (1989).
For additional information about AIC see Akaike, 1973.
For additional information on the test statistics used by the best subset search method, see McCullagh and Nelder(1989).
]]>To test specific hypotheses about the relationship between sets of items or different tests (e.g., whether two sets of items measure the same construct, analyze multi-trait, multi-method matrices, etc.) use the Spotfire Statistica® Structural Equation Modeling and Path Analysis article.
This module is focused on reliability of measurement as used in social sciences. Unreliable measurements of people's beliefs, biases or intentions hamper efforts to predict their behavior. The issue of precision of measurement is also a problem when variables are difficult to observe. For example, reliable measurement of employee performance can be difficult to collect but is required for a performance-based compensation system. Another example, the quality of an educational test could be assessed.
The quality of a process derives from the quality of the items within the process. This module can be used to construct reliable measurement scales, to improve existing scales, and to evaluate the reliability of scales already in use. Specifically, it will aid in the design and evaluation of sum scales. These are scales made up of multiple individual measurements (e.g., different items, repeated measurements, different measurement devices, etc.). The module builds and evaluate scales following the "classical testing theory model". The classical testing theory model of scale construction has a long history, and there are many textbooks available on the subject.
Note: For additional detailed discussions, see Carmines and Zeller (1980), De Gruijter and Van Der Kamp (1976), Kline (1979, 1986), or Thorndyke and Hagen (1977). The standard formulas from classical testing theory are used to compute Cronbach's Alpha, and for the attenuation correction (see Nunnally, 1970).
]]>Statistica provides the ability to compute process capability indices for grouped and ungrouped data (e.g., Cp, Cr, Cpk, Cpl, Cpu, K, Cpm, Pp, Pr, Ppk, Ppl, Ppu), normal/distribution-free tolerance limits, and corresponding process capability plots (histogram with process ranges, specification limits, normal curve). In addition, instead of these normal distribution indices and statistics, you can choose estimates (e.g., Cpk, Cpl, Cpu based on the percentile method) based on general non-normal distributions (Johnson and Pearson curve fitting by moments), as well as all other common continuous distributions including the Beta, Exponential, Extreme Value (Type I, Gumbel), Gamma, Log-Normal, Rayleigh, and Weibull distributions.
This module computes maximum-likelihood parameter estimates for those distributions, and it provides numerous options for evaluating the fit of the respective distribution to the data, including the frequency distribution with observed and expected frequencies, the Kolmogorov-Smirnov d statistic, histograms, Probability-Probability (P-P) plots, and Quantile-Quantile (Q-Q) plots. Options are also available for automatically fitting all distributions and choosing the distribution that best fits the data.
Process capability indices consistent and in compliance with DIN (Deutsche Industrie Norm) 55319 and ISO 21747 are available.
Note: Sampling plans are discussed in detail in Duncan (1974) and Montgomery (1985). Most process capability procedures (and indices) were introduced to the US from Japan (Kane, 1986). However, they are discussed in three excellent hands-on books by Bhote (1988), Hart and Hart (1989), and Pyzdek (1989). Detailed discussions of these methods can also be found in Montgomery (1991).
Step-by-step instructions for the computation and interpretation of capability indices are also provided in the Fundamental Statistical Process Control Reference Manual published by the ASQC (American Society for Quality Control) and AIAG (Automotive Industry Action Group, 1991 (referenced as ASQC/AIAG, 1991). Repeatability and reproducibility (R & R) methods are discussed in Grant and Leavenworth (1980), Pyzdek (1989) and Montgomery (1991). A more detailed discussion of the subject (of variance estimation) is also provided in Duncan (1974).
Step-by-step instructions on how to conduct and analyze R & R experiments are presented in the Measurement Systems Analysis Reference Manual published by ASQC/AIAG (1990).
Standard references and textbooks describing Weibull Analysis techniques include Lawless (1982), Nelson (1990), Lee (1980, 1992), and Dodson (1994). Note that very similar statistical procedures are used in the analysis of survival data and in Lee's book (Lee, 1992) to primarily address biomedical research applications. An excellent overview with many examples of engineering applications is provided by Dodson (1994).
]]>PLS implements partial least squares regression using the NIPALS (Rannar, Lindgren, Geladi, and Wold, 1994) and the SIMPLS (de Jong, 1993) algorithms for extracting partial least squares regression components. It is an extension of the multiple linear regression model that does not impose the restrictions employed by discriminant analysis, principal components regression, and canonical correlation. In partial least squares regression, prediction functions are represented by factors extracted from the Y'XX'Y matrix. The number of such prediction functions that can be extracted typically will exceed the maximum of the number of Y and X variables.
Therefore, partial least squares regression is probably the least restrictive of the various multivariate extensions of the multiple linear regression model. This flexibility allows it to be used in situations where the use of traditional multivariate methods is severely limited, such as when there are fewer observations than predictor variables. Furthermore, partial least squares regression can be used as an exploratory analysis tool to select suitable predictor variables and to identify outliers before classical linear regression.
Types of analyses available are Analysis of Covariance, Factorial ANOVA, Factorial Regression, General MANOVA/MANCOVA, omogeneity-of-Slopes Model, Huge Balanced ANOVA, Main Effects ANOVA, Mixture Surface Regression, Multiple Regression, Nested Design ANOVA, One-Way ANOVA, Polynomial Regression, Repeated Measures ANOVA, Response Surface Regression, Separate Slopes Model, and Simple Regression.
]]>GRM offers most of the analysis options of GLM and provides model-building methods for finding the "best" model from a number of possible models.
]]>This article will calculate a comprehensive set of statistics and extended diagnostics including the complete regression table. This includes:
The extensive residual and outlier analysis features a large selection of plots, including a variety of scatterplots, histograms, normal and half-normal probability plots, detrended plots, partial correlation plots, different casewise residual and outlier plots and diagrams, and others. The scores for individual cases can be visualized via exploratory icon plots and other multidimensional graphs integrated directly with the results Spreadsheets. Residual and predicted scores can be appended to the current data file. A forecasting routine allows the user to perform what-if analyses, and to interactively compute predicted scores based on user-defined values of predictors.
Large regression designs can be analyzed. An option is also included to perform multiple regression analyses broken down by one or more categorical variable (multiple regression analysis by group). Additional add-on procedures include a regression engine that supports models with thousands of variables, a Two-stage Least Squares regression, as well as Box-Cox and Box-Tidwell transformations with graphs.
Nonlinear Estimation, Spotfire Statistica® Generalized Linear Nonlinear Models, and Spotfire Statistica® General Partial Least Squares Models modules can estimate practically any user-defined nonlinear model, including Logit, Probit, and others.
SEPATH, the general Structural Equation Modeling and Path Analysis module, allows the user to analyze large correlations, covariances, and moment matrices for intercept models.
Additional advanced methods are provided in the Spotfire Statistica® General Regression Models module. This module includes best subset regression, multivariate stepwise regression for multiple dependent variables, for models that may include categorical factor effects; statistical summaries for validation and prediction samples, custom hypotheses, etc.).
]]>The Factor Analysis module contains a wide range of statistics and options. It provides factor and hierarchical factor analytic techniques with extended diagnostics. It will perform:
Confirmatory factor analysis, as well as path analysis, can also be performed via the Structural Equation Modeling and Path Analysis (SEPATH).
Note: A hands-on how-to approach to factor analysis can be found in Stevens (1986). More detailed technical descriptions are provided in Cooley and Lohnes (1971), Harman (1976), Kim and Mueller, (1978a, 1978b), Lawley and Maxwell (1971), Lindeman, Merenda, and Gold (1980), Morrison (1967) or Mulaik (1972). The interpretation of secondary factors in hierarchical factor analysis, as an alternative to traditional oblique rotational strategies, is explained in detail by Wherry (1984). Fabrigar (1999) addresses the controversy over differences between principal components analysis (PCA) and factor analysis.
Historical note: The term factor analysis was first introduced by Thurstone, 1931, although similar techniques were used by Spearman as early as 1904 in his classic research on the nature of intelligence.
]]>