Jump to content
  • Data Access and Data Wrangling with Spotfire Statistica®


    This is your guide to understanding important Statistica functionality. This includes actions like accessing, cleaning, joining, subsetting, summarizing and transforming data.

    Data Type

    If a data source defines the data type, Statistica will use that data type.

    Statistica will analyze the data to determine the data type when importing from the following files. Since Excel does not require a consistent data type for an entire variable, the data is analyzed. And text files don't contain data types. 

    • .csv
    • .txt
    • .xls and .xlsx 

    Data type is broken into two different concepts:

    1. storage type: byte, double, integer, text label, text fixed length, text variable length

    • Byte supports numbers between and including 0 through 255. 
    • Double is the default format for storing numeric values in Statistica. The values are stored as 64-bit floating point real numbers (unless a 32-bit operating system is being used), with 15-digit precision (1 bit for the sign, 11 for the exponent, and 52 for the mantissa). The range of values supported by this data type is approximately ±1.7*10308.
    • Integer supports numbers between and including -2,147,483,648 through 2,147,483,647. 
    • Text label is special. It is useful for analyzing ordinal or categorical data. A variable's value can have two identities: text and numeric or numeric and numeric. The user sees "Male", but Statistica analyses see 1. The user sees the customer number, but the analyses sees 5. Text labels are case insensitive; Smith is equivalent to smith. 
    • Text can have a fixed length or variable length. Statistica will analyze the data and determine which option is more efficient for perfomance of analytics or storage.

    2. measurement type: auto, continuous, categorical, ordinal

    • Auto means that Statistica automatically analyzes the data and decides between  continuous or categorical.  
    • Continuous measurement is numeric with a scale like weight in pounds.
    • Categorical is a method for classifying a limited number of measurements. You may also see this also called a discrete measure. This can be numeric or text. For example, the day of week can be Monday, Tuesday, Wednesday, ... or 1, 2, 3, ... 
    • Categorical variables can be further sub-categorized as nominal, ordinal or dichotomous. 
    • The ordinal scale of measurement represents the ranks of a variable's values. Values measured on an ordinal scale contain information about their relationship to other values only in terms of whether they are "greater than" or "less than" other values but not in terms of "how much greater" or "how much smaller." This can be numeric or text.

    Data Sources

    Corporations around the world depend on Statistica for their analytic needs. Statistica answers these needs by providing a single-tool to mashup data from a wide variety of sources. Once configured, verified users have access to data extraction, data cleaning and advanced analytics. The data sources supported by Statistica include the following:

    • TIBCO® Data Virtualization; catalog, discover, virtualize
    • Spotfire® to access data sources via a data connection
    • Any ADO.NET or ODBC relational database such as Oracle, or Microsoft SQL Server
    • API - if the data source has an API, then the customer can write R, C#, Python, or Spark to access the data
    • Hierarchical Process Cubes like Enterprise Resource Planning (ERP) such as SAP Hana (SQL or MDX)
    • Teradata
    • Data historian repositories such as the PI Data Historian from OSISoft, Inc. and the newer PI AF / EF
    • Files such as CSV, TXT, Microsoft Excel, Spotfire SBDF, Minitab, JMP, SAS, SPSS, HDFS text file, fixed position format text (mainframe data file)
    • Statistica spreadsheet has ODBC driver
    • Amazon Redshift
    • Amazon S3
    • Amazon Athena
    • H2O
    • Images files for Statistica to digitize RGB color or grayscale to numbers
    • Unstructured text or websites
    • manual data entry via TIBCO Statistica® Data Entry Server (double blind data entry with analytics)

    Technically Statistica supports the following data file extensions:

    • dBASE; .dbf
    • Excel; .xls, .xlsx, .xlsm, .xlsb
    • HTML; .htm, .html
    • Image; .bmp, .jpg, .jpeg, .png, .gif, .tif, .tiff
    • JMP; .jmp
    • Minitab; .mtw
    • Rich text format; .rtf
    • SAS; .sd, .ssd, .sas7*, .tpt, .xpt
    • SPSS; .por, .sav
    • Statistica spreadsheet; .sta
    • Text; .csv, .txt
    • Word; .doc, .docx
    • XML; .xml

    Get Data

    After starting Statistica, select the Home menu - pull down arrow on the New menu.  Now select the Workspace menu. A list of templates will display. Select the Get Data template.

    This workspace will contain almost all of the nodes that can retrieve data for analysis. Some of the nodes may be disabled if the user does not own the associated product. If data needs to be retrieved from OSISoft PI database, then look for a template named PI Asset Framework or PI Asset Framework and Event Frames

    Amazon S3 Data

    This node requires the URL to the Amazon S3 public data and a field delimiter (typically comma).

    If the first row of data is not the variable names, then uncheck the box.

    And if you are uncertain about the size of the file, set the maximum number of cases (rows) to a small number.  Otherwise, leave this option blank and the entire dataset will be used. 

    Automated Generation of SQL (Query Spreadsheets node)

    All products contain this node. Statistica has a graphical editor to create complicated SQL statements automatically from a drag-and-drop environment.  Why is this of interest? 

    The Statistica workspace is a visual no-code analytic workflow tool. The user can merge multiple datasets via a node named Query Spreadsheets and write no-code SQL. Another way to think about this node... it can blend datasets in a hybrid environment into one dataset; database, cloud source, local file. It is recommended to use the Advanced Query Builder button. 

    If the data source requires multidimensional expressions (MDX), then use the Statistica Query button. This will open a separate application. After selecting or creating a database connection, select the View menu. Make sure that the Cube/MDX menu is selected. 

    statistica_query_data.png.6675d7303a51b454d52633bebe2b89c8.png

    Database Connector

    This feature is formerly known as the in-place database processing (IDP). If you are working with an analytic workflow (workspace), look for a node named Database Connector. Starting with Statistica 13.4, all products include this node for the workspace. 

    statistica_database_connector.thumb.png.136ce8e49efe49bb5784c07449c69b3d.png

    The Database Connector accesses databases using a one-step process. It does not require creating local copies of the data.  And the connector uses the processing resources (multiple CPUs) of the database servers to execute the query, extract the requested data records, and send them to the Statistica workstation. You may therefore see good performance of the analytics.

    However, there are situations when using the Database Connector is not recommended for performance reasons. For example:

    • multiple analyses are to be performed on the same data set (for example: calculate the mean and then logistic regression)
    • specific analytic method requires multiple passes through the raw data set
    • data is available on the local device

    There are exceptions to the recommendations above. If the data's size exceeds the capacity of the local storage device, then use the Database Connector. Or when Statistica workspaces are run as automated jobs, it may be more efficient to avoid multiple import (and saving) operations of files.

    Enterprise Ad Hoc Analysis Data

    This node requires the purchase of a Statistica Server.  This node calls an analysis configuration that is managed with the Statistica Enterprise Manager application. 

    Enterprise Ad Hoc Analysis Data with Input Data

    This node requires the purchase of Statistica Server.  This node calls an analysis configuration that is managed with the Statistica Enterprise Manager application. 

    Enterprise Data Configuration

    This node requires the purchase of Statistica Server.  This node calls a data configuration that is managed with the Statistica Enterprise Manager application. 

    H20 Data

    Statistica integrations with H20 Sparkling Water. This node requires the H20 URL plus the location of the data source. The data source can be located in Amazon S3, Dropbox, HDFS or a local file on the H20 server. 

    Import Excel

    Import data from an Excel file. 

    Import Fixed Data

    Import a text file that has a fixed location format; mainframe data file. 

    Import HDFS Text

    This node requires the purchase of Statistica Server.  The connection to the Hadoop server is defined within the Statistica Enterprise Manager application. 

    Import Spotfire

    Import data from .SBDF data file. 

    Import Text

    Import data from a text file (.TXT, .CSV) with delimiters between variables. 

    In-Database Enterprise Data Configuration

    This node requires the purchase of Statistica Server.  

    PI Assets Query

    PI Event Frames Query

    Spotfire Data Connection

    This node is available with all products. It allows users to retrieve data from the Spotfire Library for:

    • Data Connections (or Connector connection)
    • Information Links
    • Data files (or SBDF files)

    Writing Data into Database (Write Spreadsheet to Database node)

    Cleaned data and/or analytic results can be inserted or appended into Access, SQL Server, Oracle, MySQL, Teradata, SQL Server PDW, PostgresSQL or DB2. 

    write_data_to_database_0.thumb.png.a7b9017f8c86ad6a4f8de7784a1677c7.png

    Data Wrangling (Cleaning & Transformation)

    Rank

    This will transform the selected variable. If you want to keep the original values, then you will need to add a new variable. This can be done via the Transformation of Variables node. 

    transform_rank.png.09d97757ef1ecb749619c43dd0bc1bbc.png

    This functionality can be used for numeric and character data. However, the more common uses are related to numeric data. 

    The most common reason for ranking a variable is data reduction. The data is binned, grouped, etc. This technique is related to feature selection and dimensionality reduction because it shares a common goal. The goal is to focus the data on the most meaningful parts... eliminate noise... analyze data faster because the data has been reduced. Ranked data is typically analyzed with Nonparametric statistics rather than descriptive statistics. 

    Another common reason relates to the presentation of results to the end user. Maybe the statistical results are grouped/broken down by !, 2, 3 to provide the most meaning to the end user. 

    Rank 1 can start with the smallest or largest value in the variable. If the value is tied (for example 1 is in row 2 and row 30), there are four different ways to break the tie. And regular number ranks, factional ranks and percentages of factional ranks can be created. 


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...