Jump to content
  • Spotfire® Parallel Data Loading FAQ


    Starting with TIBCO Spotfire 11.4 LTS, data is loaded into the Spotfire data engine, and processed, on parallel background threads when opening an analysis. In previous releases, data was loaded and processed in sequence. This article answers the most common questions related to this significant change in how data is brought into Spotfire.

    What's loaded in parallel?

    • Data sources

    • Custom data sources

    • Calculated columns

    • Data transformations

    • Hierarchies

    • Add columns/rows

    Note: In-database data runs from the visualizations, and data functions and on-demand data sources are executed separately in parallel, like in previous releases, after the regular data loading.

    When can I expect increased performance?

    The biggest performance increases will be in analyses with multiple data sources. 

    Example 1

    Condition

    Spotfire 10.10

    Spotfire 11.4

    Comment

    IL1(mssql, 2.2GB)

    00:07:58

    00:07:56

    Expect similar performance for analyses with a single data source.

    IL2(gpdb1, 250MB)

    00:00:19

    00:00:17

     

    IL3(gpdb2, 3.5GB)

    00:02:55

    00:02:44

     

    IL1 with IL2

    00:08:17

    00:07:56

    With two data sources, we could see improvements to the total load time.

    IL1 with IL3

    00:10:55

    00:07:58

     

    IL2 with IL3

    00:03:15

    00:02:46

     

    IL1 with IL2 and IL3

    00:11:15

    00:07:54

    In certain cases, like in this one, we basically get three data sources for the waiting time of one.

    Example 2

    Testing showed generically faster load times. In some cases, only a 20% increase could be seen, but in one case, loading was 60% faster (the load time went from 1:14 min to 0:30 min), and in another case it was 70% faster (from 3:25 to 1:08 min).

    Has the individual data table load speed been improved?

    Parallel data loading operates on the data source level. Remember that the final data tables that visualizations are created upon can consist of data from a single data source, or, data from a lot of joined data sources. Parallel data loading has the largest performance increase when loading data tables using many joined data sources, or when simply loading multiple data tables in one analysis.

    Are multiple on-demand tables loaded in parallel?

    Yes, analysis files with multiple data tables loaded on-demand will be loaded in parallel. However, on-demand data tables were loaded in parallel also in versions before 11.4, so there is no change in performance there.

    Are data table data sources computed in parallel?

    When I create a new data table from a table that's already loaded into the analysis, will that creation also factor into the mix when opening the file up?

    Data table data sources are calculated in parallel, yes. Any calculated columns based on the data table data source are also calculated in parallel.

    Note: Data functions are executed (in parallel) after data has been loaded.

    Are data table data sources displayed in the progress dialog?

    Yes.

    Will data load in parallel from multiple tables in the same database?

    Yes, they will be loaded in parallel but you will need to test if your database is able to deliver data faster in parallel or sequentially.

    Are we changing the default behavior?

    Yes. Data is by default loaded in parallel, starting with 11.4 LTS.

    Can I disable parallel data loading and use the classic loading method?

    Yes. Using the installed client, it is possible to switch off the parallel data loading entirely under Tools > Options > Document > Data loading, or, for a single analysis, through File > Document properties > Compatibility Settings.

    An administrator can also control the preferences for a group in Administration Manager > Preferences > Application > DataImportPreferences > ForceSerialDocumentDataLoading.

    The ForceSerialDocumentDataLoading preference overrides the Tools > Options preference.

    Custom data sources can be specified to execute on the application thread by overriding the API property AlwaysLoadOnApplicationThread.

    Will data functions load data in parallel?

    Yes. They have always executed in parallel.

    Any considerations when doing performance comparisons?

    When testing data loading performance, keep in mind that there is a connector cache and information link caching available. Especially, remember this if you rerun a test and get vastly improved performance the second time.

    With many databases you should measure multiple times, as performance will vary from time to time.

    I'm loading 10 files using Add rows into one Spotfire data table. Will Parallel load improve performance?

    Yes, performance will be improved. If you have ten files and add them as rows to a single data table, you may definitely see a performance improvement.

    "How" parallel - how many data sources at once? Any config/hardware factors that affect this (like cores, etc)? Can it be configured?

    This is based on the number of available cores and dependencies between data tables. For example, a data table data source cannot be loaded before the source data table has been loaded. Information links are limited by the maximum number of connections. Currently, this cannot be configured.

    Any additional factors to consider when this is done in Web Player/Automation Services?

    Just as in previous releases, any number of concurrent users can open any number of analysis files at the same time.

    Is there any documentation/considerations about increased load at the external data source, and how that can be managed?

    For information links, this can be managed with the setting for maximum number of connections. Otherwise, this can be compared to visualizations. 12 visualizations on a page will generate 12 concurrent queries.

    Is this applicable for all ways of accessing data - Connectors, Information links, ODBC etc?

    Yes, all imported data will be loaded in parallel.

    Is there an example of how much troubleshooting/timing details that can be found in the progress dialog clipboard data?

    No, and this can vary over time. As new functionality is added to Spotfire, more information will become available.

    As this may increase the risk for a traffic jam at the underlying data source side, is it easy to tell when a data source is waiting in line to get started vs. when data is being imported?

    Yes, it is easy to see which active data sources are currently executing, which data sources are waiting in line, and which data sources that are finished.


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...