Jump to content
We've recently updated our Privacy Statement, available here. ×
  • Spotfire® Data Access FAQ


    How do I connect to MongoDB?

    Please see this article:

    https://community.spotfire.com/s/article/analyzing-mongodb-data-tibco-spotfire

    How do I connect to Amazon S3 or ADLS?

    Please see to this article:

    https://community.spotfire.com/s/article/amazon-s3-and-azure-adls-data-tibco-spotfirer

    How do I connect to web services?

    The most lighweight solution is to use a Spotfire Data Function written in Python and leveraging the many Python libraries available for accessing and preparing web services data.

    An alternative is to configure web services access in TIBCO Data Virtualization (TDV) and access it using the Spotfire Connector for TDV.

    Spotfire loads data in parallel. Anything I should be aware of or plan for before running 11.4 or later?

    This article answers the most common questions:

    community.spotfire.com/s/article/tibco-spotfirer-parallel-data-loading-faq

    How do I connect to and analyze data in SAP BW?

    Please see recommendations describe in this article:

    community.spotfire.com/s/article/recommendations-connecting-spotfire-sap-bw

    How do I analyze data in SharePoint Online?

    Please see community.spotfire.com/s/article/overview-how-access-microsoft-sharepoint-data-tibco-spotfire

    What is your recommendation for when to load data up front and when to use on demand?

    Spotfire® can extract data from a data source up front or load slices of data on demand based on your interactions with the Spotfire clients. Loading "static" data up front has many benefits. You have all the data at hand during your analysis session, with no loading times while analyzing your data. You can analyze your data in the Spotfire data engine, independently from the data source, and choose when to refresh data. Data on demand, however, enables you to analyze more data, it's really a big data feature. It enables more data to be analyzed by providing slices of data from the data source based on your interactions with the analysis. Data on demand is frequently used for importing data into Spotfire's in-memory data engine but is also available for in-database data tables. On demand in combination with in-database reduces the amount of rows processed by in-database queries, thus enabling analytics across larger data sets and speeding up the rendering of visualizations. Below is a table with some of the pros and cons of each method.

    Scenario Loading data up front/static Loading data on demand
    Data size The size of data that can be analyzed depends on how much memory the client machine has and how long it's feasable to wait for data to load. For larger data sizes, data loading is often scheduled to be performed during nighttime. If this option is feasable, it's the recommended data loading option. Enables more data to be analyzed by only providing the data needed at a given time. Works both in-memory and in-database.
    Loading times All loading time happens up front, before any analytics can be done. Data from the previous session is loaded immediately as it's cached in the analysis file. New data is loaded during the analytic session as needed which means there are waiting times every now and then during an analytic session.

    We are replacing our SQL databases with Hadoop. What considerations are there from a Spotfire perspective when moving from SQL to Hadoop?

    Just like with SQL databases, Spotfire connects to Hadoop's SQL interfaces. It's important to select Hadoop query engine carefully to avoid surprises later on. Hive, Impala, Spark SQL and Drill all offer different aggregation functions. Below are a few things to consider.

    Connection method

    Spotfire lets you analyze data from data tables. These data tables can be copied into Spotfire's in-memory engine or be kept in the external data source. The latter data table type, in-database data tables, are only represented by a metadata view in Spotfire. 

    In-memory Live queries/in-database In-memory + live queries
    Queries are executed in Spotfire's in-memory data engine that provides extremely fast calculations on any data set and enables all of Spotfire features. On-demand data tables are supported. Queries are executed in the external data source's data engine. Expect to be able to work with the largest amounts of data but also longer waiting times for visualizations and filters to refresh, compared to in-memory data tables. On-demand data tables are supported. The key to success is often to combine data loading methods in the same analysis. For example, load as much as possible into memory up front. Then use live queries for the MASTER view of big data aggregates. When users drills down into the DETAILS, load row level transactions into memory using on demand.

    Query speed

    With higher query speed, more data and more complex calcuations are feasible to analyze using in-database data tables and live queries. Higher query speeds also means that (scheduled) data extracts takes less time to complete.

    SQL database Hadoop
    Analytical databases like Teradata, HPE Vertica, IBM Netezza, SAP HANA and IBM DB2 BLU are in general very fast. This means they are often capabale of serving many concurrent Spotfire users and analytic files using in-database data tables and their live queries.

    Query speeds are often not on par with a SQL database but on the other hand scales across virtually unlimited amounts of data. Hive on MapReduce is not recommended for in-database and interactive live querying.

    Expect better performance towards the SQL engines Hive on Tez with LLAP, Impala, and Spark SQL.

    There are also databases tightly integrated with Hadoop. Examples are Apache HAWQ and IBM BigSQL. In this case they are represented as a Greenplum and an IBM DB2 database to Spotfire, with similar capabilities.

    In addition, there are also Microsoft SQL Server Analysis Services compatible cubes available on Hadoop. Examples are Apache Kylin, Kyvos and AtScale.

    The indexing engine Jehtro also provides a very fast SQL query interface for Hadoop. Facebook's database Presto is also worth mentioning.

    Function support

    SQL Database Hadoop
    Wide, often including functions like binning. The number of supported functions are in general not on par with most SQL databases but is growing over time. Data connectors hold a mapping of Spotfire functions to functions in the connector's data source. If you believe a certain function should be mapped but isn't, please share your idea with us on Spotfire's Idea portal here.

    For more details on function support please see the Spotfire web help here, browse to Connectors.

    We use Spotfire's Unpivot data transformation a lot. Is it possible to do the Unpivot before end users load the data by pushing the Unpivot operation down to the database by configuring it as part of an Information Link or a Data Connection?

    Spotfire's Unpivot data transformation is available when adding a data source, for example an Information Link, to your analysis. It's also possible to add later as a separate step. To perform this operation in the data source, custom SQL needs to be used, either as part of an Information Link or as part of a Custom Query in a Data Connection.

    How can I include or exclude data (rows) from a data set before loading it into Spotfire? For example based on values in a true or false column?

    When connecting to a data table it saves both loading time, memory and time to set filters if an inclusion/exclusion criteria is defined in the data connection. Spotfire provides a few different ways of doing this.

    Prompts

    Information Links and Data Connectors both support prompts. These makes it easy to create selection dialogs that business users can select unique values from.

    Parameterized Custom Queries

    Data connectors support parameterized Custom Queries that let you pass values based on interactivity in your analysis into your SQL query. This can be used to limit data based on for example document properties or Spotfire's DOMAIN, GROUP and USER variables. The latter is useful to filter out personal data from a generic data set. 

    On demand

    On-demand is a popular way to limit data both before initial loading and during an analytic session based on user interactions with marking, filtering and actions causing document properties to change.

    What's your recommendation for implementing Row Level Security in combination with very large embedded data?

    Because of the time it takes to load an analysis file with gigabytes of embedded data, scheduled updates are used. Scheduled updates loads the analysis file into the Web Player's memory so that it's ready for business users when they for example come into the office in the morning. However, since data is loaded by the scheduled updates user which is a service account, all data is loaded and in many cases users are only allowed to see their peronal slice of the complete data set.

    To combine pre loading of large embedded data sets using scheduled updates with personal data, a lookup table and Spotfire user identities are used in combination with the complete data set. This is refered to as Personalized Information Links which is decribe in more detail here.

    The idea is that a join is much quicker to perform than data loading. When a user logs in and opens the analysis file the first thing that happens is that the complete data table is joined with the users rows in the lookup table. The result is a much smaller data set, only containing the rows that are left after the join with the lookup table, and only personal data.

    Note: This is a solution for the 1000's of consumers using the Spotfire Consumer client on the Web Player. Users with authoring licenses using the Spotfire Analyst or Business Author clients can edit the analysis files configuration and gain access to the full data set.

    What's the recommended way of connecting to star schemas?

    Even though Information Links can be used to connect to star schemas it's much easier to connect using data connectors. The table below summarizes some of the differences.

    Capability Information Links Data Connectors
    Relations Relations are (re)created manually in Information Designer.

    Users configuring data connections can load relations from the data soruce with a couple of mouse clicks. The relations automatically become part of the (in-database) view to be analyzed.

    If data is imported into memory the selected relations will define a join that is performed in the data source. The resulting data table is then loaded into memory. Since the result is only one data table visual analytics becomes very easy to do with no need for additional add columns operations or relations.

    Drivers

    JDBC drivers only.

    Drivers are needed on the Spotfire Server only.

    No drivers, ODBC drivers and ADO .Net drivers depending on connector used.

    Drivers are needed on the Node Manager/Web Player server and on the Windows clients running Spotfire Analyst.

    What can I do to improve Spotfire?s data loading speed?

    First of all, do you need to extract all that data? Spotfire works with both live queries against external data/data engines and data extracts imported into Spotfire's in-memory engine. By combining these data loading types and visualizations in the same analysis file you can often reduce the number of rows you actually need to extract into Spotfire.

    You are probably aware of the general factors impacting your data loading speed. Examples are hardware resources, database/cluster size, query engine performance, query complexity, network bandwidth, the number of concurrent users and the importance of using the latest drivers recommended by the database vendors.

    But there are capabilities and settings in the Spotfire platform that are good to be aware of. These impact the time it takes for data to be loaded and ready for visual data discovery. Some of these capabilities and settings are highlighted below. They are divided into loading data using live queries and in-database data tables and loading data as extracts into Spotfire?s in-memory data engine.

    Live queries/in-database

    Big data analytics using live queries and in-database data tables is directly dependent on how fast the SQL or MDX query engine can process Spotfire's queries.

    Pages in analysis files use multiple visualizations per page, filters on multiple columns and brush linking for joint marking across visualizations.

    If the use case is a dashboard the user interaction with the analysis file is often lower with modest load on the data source?s query engine. In this case, Spotfire?s live query cache is extremely useful as a dashboard tends to reuse the same queries more often with no need to push down queries to the data source. Users are simply using the visualizations more or less as they are configured by the author of the analysis file.

    If data in the data source is not updated live or during the use of the analysis file or if it?s not part of the use case to analyze live data, it?s recommended to increase the default time of the live query cache to 60-120 minutes. This will ensure that even sporadic interaction with the dashboard and the resulting queries will make use of the live query cache.

    In-memory

    Spotfire's in-memory data engine ingests data as soon as it?s delivered by database drivers, the data source APIs or the file reader. There are however capabilities and settings to consider.

    Data transformations that are part of the data loading step (part of the data source) or applied as a following data transformation step, take time to process. Reviewing data transformations might speed up the data loading step.

    When loading star schemas you could consider if joining the tables you need in the data source or ingesting them into Spotfire?s data engine as separate tables and then working with Spotfire?s relations would be most effective. Connectors by default flatten/joins star schemas in the data source before load, which makes it easy to work with connectors but can be time consuming and increase the size of the table that is loaded.

    Narrow data sets in general load more quickly and are more suitable for analytics. Wide data sets (1000?s of columns) in general take more time to load. In addition, they often require a potentially time consuming Unpivot data transformation to transform them to a narrow data set.

    How many rows can Spotfire extract into data tables?

    In-memory data tables: This depends on how much RAM you have available but the limit is often between 200M and 500M rows depending of the number of columns.

    In-database data tables: the number of rows and columns is limited to what is needed to draw visualizations and filters. The data tables holds only aggregated tables of data. The rows are kept in the data source.

    Besides saving the data tables as CSV files, which have limitation in file size, can users save the big data tables in Spotfire file formats for later use?

    If you need to save data to file then use the Spotfire Binary Data File format (SBDF). For more information see Working with large data volumes.

    How many columns does Spotfire in-memory data tables support?

    This depends on how much RAM you have available but typically multiple thousands of columns.

    sf.writeback.example.pdf


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...