Jump to content
  • TIBCO Spotfire Primer - Blog 2 - Hybrid in-database/in-memory aggregations


    This is the second of a series of blogs covering various aspects of the brand new book - TIBCO Spotfire, A Comprehensive Primer, by Andrew Berridg

    The first blog post introduced Chatper 5, "Practical Applications of Spotfire Visualizations". This post covers some content from Chapter 10, "Information Links and Data Connectors" - specifically the part of the chapter that discusses the use of Spotfire's custom expressions when working with in database data. 

    But first, a quick recap on in-memory versus in database analytics. I quote from the book:

    Spotfire supports in-memory and in-database analytics. So far in this book, we have only been using in-memory data and analytics. Spotfire's in-memory data engine is highly optimized for working with small, medium, and large datasets, but eventually, your computer is likely to run out of memory if you have very large datasets. People often ask me: how large is a very large dataset? Or how much data can Spotfire work with in-memory? For me, this is one of the most difficult questions to answer!

    It really depends on how many rows and how many columns of data you are working with. I generally find that the Spotfire in-memory data engine can handle hundreds of millions of rows of data with tens of columns, or hundreds of thousands of rows with thousands of columns. If a dataset is bigger than that, then I recommend using in-database analytics or data on demand. Both of these concepts are detailed in this chapter. When you load data into Spotfire's in-memory data engine, an entire copy of the data is loaded into the memory of your computer (or web server). The in-memory data engine is designed specifically for analytics?it compresses data, removes redundant columns, and swaps data to disk if memory runs low. The in-memory data engine can aggregate data at any level and can also show the raw data or details of individual rows. Using the in memory data engine gives you the ability to use Spotfire's rich set of expressions that you'll have seen in Chapter 7, Source Data is Never Enough.

    In-database analytics pushes the analytic power to the underlying database system. Spotfire can, via its connectors, query databases to get the information it needs to draw visualizations on extremely large datasets.

    Imagine a bar chart showing global sales of all product categories for an extremely large consumer products company. You could try to load all that data into Spotfire, but I doubt it would work. Imagine the sheer volume of data?all the transactions for every store that sold every product. The transactional data could run into the billions or even trillions of rows. The load time would be significant and the IT infrastructure to support the transfer of all that data would be impractical.

    A much better solution would be to execute a query against the database that just returns the information that Spotfire needs to draw the bar chart?it could be a query that returns quantities sold grouped by product categories over time. Then, all Spotfire needs to hold in memory is the result of that query, which is what is shown on the bar chart. The resultant dataset is only a few hundred or thousand rows. Large database systems, particularly those that are so-called big data solutions, are highly optimized to return query results quickly on such large datasets.

    Let's take it one step further in our hypothetical scenario. Now that the bar chart is shown, how do you drill into that data with details visualizations? Consider that the bar chart just shows what's going on in the data. You're highly likely to want to know why something is happening, which you would do by looking at the raw data in some way. Fortunately, Spotfire makes this pretty easy too. It uses a concept called data on demand. Data on demand is used to retrieve data from a database based on something changing in Spotfire. It can be configured to retrieve details of marked rows, filtered rows, or of fixed values, values stored in document properties, or details from a custom expression.

    Data on demand can be used to perform another in-database analytics query or it can retrieve the raw row values into memory.

    Part of the challenge with in-database analytics, is that the full suite of Spotfire custom expressions is not available to you. You fancy using OVER expressions, for example, to calculate a cumulative sum? This can appear challenging at first sight. This is because the Spotfire data connectors query the underlying database directly - they build queries to extract the summary information - just enough to visualize the data at some aggregated level, but databases don't support the OVER expressions so you can't use them directly.

    Let's go back to the hypothetical scenario described above. You could imagine that Spotfire needs to show total sales for each month - the query passed to the database might be something like this:

     SELECT Sum(Sales) as Sales, month([saleDate]) as [Month]   FROM SalesData   GROUP BY month([saleDate])

     

    This query sums the total sales for each month and would produce a bar chart that looks like this:

    sales_by_month.thumb.png.0965e63af8f3efe36afcf09f48f90d25.png

    Remember - you, as the user, do not have to enter the SQL query - Spotfire does all that for you automatically when you choose what you want to show on each of the axes!

    Now, let's say we wanted to calculate the cumulative sum. It could be that the database doesn't support cumulative sum or might be rather inefficient in doing so. For example a quick Internet search on cumulative sum, SQL Server reveals some interesting insights. Apparently, a lot of database administrators will be very upset if you try to calculate something like this on their database inefficiently! It seems that calculating cumulative sum is rather difficult for SQL Server...

    Better yet, let's use Spotfire's custom expression language to calculate the cumulative sum. After all, we know the values of each of the individual bars, we just need to sum them over time. We'll be using the standard axis selector to choose the cumulative sum aggregation, then inspecting the resultant custom expression to find out what's going on under the hood.

    We can choose Cumulative Sum like this:

    cusum.png.6eb58603c040d15b1ac01f124d338118.png

    The Spotfire visualization will now update like this:

    cusum_vis.thumb.png.19e78aea94fa55eff916b057c3a279c6.png

    Now this visualization is using the combined power of in-database and in-memory calculations - first of all, the database is being queried to return the raw sums for each month; then Spotfire is calculating the cumulative sum in memory. Therefore, we get the best of both worlds: a small amount of data is returned to Spotfire; the cumulative sum calculation is performed efficiently in memory, rather than inefficiently within the database.

    Inspecting the custom expression yields the following:

     Sum([sales]) THEN Sum([Value]) OVER (AllPrevious([Axis.X]))

     

    So, what's going on here? The key is the THEN keyword. It leads to what is known as a post-aggregation expression - i.e., there is an initial aggregation, followed by a further aggregation. Using post-aggregation expressions makes OVER expressions much easier to work with than otherwise. You can read more about OVER expressions in Chapter 7 of the book. Its title is "Source Data is Never Enough".

    The initial expression of Sum([sales]) is passed directly to the underlying database (Spotfire will automatically add the necessary SELECT, Group By, etc. clauses to the SQL). The database will return a small number of rows to Spotfire as an in-memory data table for visualizing the bar chart. Although the table isn't shown to you (it's hidden), you can imagine it like this:

    salestable.png.59b8b110d32ddc1fd6747845e4fdd704.png

    The post-aggregation part of the expression then operates on the Value column (in this case it's sum of sales). It's simply:

     Sum([Value]) OVER (AllPrevious([Axis.X]))

    All this is doing is performing a cumulative sum. The cumulative sum is calculated in-memory against the (hidden) data table that drives the visualization.

    Now that we understand how the combination of hybrid in-database and in-memory aggregations work, we can then apply all sorts of other post-aggregation expressions to the value axis of the bar chart - moving averages, difference year-on-year, etc.. You can even apply the post aggregation expressions to streaming (live) data!

    It's critically important to understand that post-aggregation expressions can only be used on the value axis of a visualization and even then, only if the x-axis is categorical (or binned). It's something I have not realized in the past!

    To summarize:

    Spotfire gives us the ability to perform hybrid in-database and in-memory aggregations. We need to use in-database aggregations when working with large data volumes. When we do this, we can then operate on the data returned by the database using Spotfire's hugely powerful expression language. By combining in-database and in-memory aggregations, we get the best of all worlds. Fast performance on large data; small volumes of data being transferred; powerful and flexible calculations.

    The TIBCO Community page on the book is a great starting place for further exploration of topics covered in the book. Another related and very useful page on the community is: TIBCO Spotfire® Enablement Hub - this is a launchpad for all sorts of great stuff! 

    This is the second of a series of blog articles on the book - watch this space for more excerpts and related concepts. I'll be picking various topics and discussing them - providing introductions to those topics in some cases; expanding on the content in the book in others.

    Pick up a copy of the book today from Amazon


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...