Introduction
Web based Visual Analytics with streaming data allows organizations to easily and cost effectively distribute real-time dashboards and analytical applications to a large number of users. This lets the organization increase awareness about the current state of operations, identify opportunities and problems quicker and also more quickly take action with more recent information. Overall visual analytics with streaming data provides a new opportunity for optimizing operations in businesses such as logistics, transportation, financial services, energy and manufacturing. But to do this, it is required to be able to cost effectively put real-time information in the hands of business users. Spotfire is known for giving business users self service dashboards and visual analytics, and since Spotfire 10.0 this includes streaming data. Since version 10.6, Spotfire introduced the ability scale out real-time dashboards and applications to large numbers of users through their web browser. Visual Analytics for streaming data in Spotfire web clients is designed so that it can scale to a large number of users, up to 1000’s of users. Exactly how it scales depends on a number of factors such as the data update rate and the visualizations used and many other factors outlined later in this document. The document also provides some guidelines on how to appropriately size machines running Spotfire web player with streaming data.
Further, the document provides an overview of the system and how different features of a Spotfire dashboard affects performance and resource usage. We will also give some guidelines and best practices that you can take into account when you design a Spotfire dashboard.
We will also give some advice on what measurements you can do to estimate the sizing for a specific dashboard and how to interpret the results.
Finally we will also describe the results from some indicative measurements that we have made in our lab. Please note that if you intend to size a system, Spotfire recommends that you do profiling with your specific data. The measurements described in this document are to be seen as examples only, and may not reflect behavior in your environment, usage patterns, or data.
System Overview
The following diagram shows an overview of Spotfire for streaming on the web player.
To use Spotfire with streaming you will need an instance of Spotfire Data Streams (hereafter abbreviated SFDS).
SFDS can receive events from a large set of sources. The events are processed and the result is eventually stored in data tables inside SFDS. From the Spotfire web player's perspective SFDS is an external database just like any other database that Spotfire can connect to. A streaming visualization in Spotfire will generate a LiveQL (SQL like streaming query language) query which it sends to SFDS. The result of that query is rendered in the visualization.
This means that most of the computations on the streaming data is performed by SFDS. However the rendering of visualizations is done by Spotfire. Some dashboards will put most of the load on Spotfire and only a modest load on SFDS while others may put a lot of load on SFDS. It depends on several factors and we will return to that topic throughout the document.
It is also worth pointing out that Spotfire and Spotfire Data Streams are independent systems. You may have several instances of SFDS that contains streaming data from different parts of your business and you may have several instances of Spotfire that serves different geographies, business units or use cases. This document briefly mentions some considerations for how different Spotfire features and usage patterns affect the load on SFDS, but it does not cover sizing of SFDS.
Continuous Queries
A key difference between SFDS and ordinary databases or data warehouses is the support for continuous queries (a.k.a streaming queries).
When a continuous query is issued, such as due to configuring a visualization, SFDS will first compute an initial result based on the current data in the SFDS table. The initial result is returned to and rendered by Spotfire. When new data from the data source streams into SFDS, it will incrementally compute a new result of the query and send an incremental update of the result to Spotfire, if needed. When Spotfire receives an incremental update for a query, that will by default, trigger a rendering of the visualization
The size of a query result corresponds to the number of visual elements in the Spotfire visualization. For example, for a bar chart with five bars, the initial result will just contain the height of the five bars and their labels. Thus the size of the query result may be very small even if the amount of data that is stored in SFDS is large. Also, the size of the incremental updates corresponds to the number of visual elements that are changed. Thus the cost in Spotfire is independent of the amount the data that is stored in SFDS.
The cost in SFDS of computing the initial result and the incremental results of a query depends on a number of factors. Naturally, it depends on the amount of data that is stored in the SFDS data table, how often it is updated and the configuration used. But it also depends on the form and complexity of the query.
Query Sharing in Spotfire web player
If several users view the same data (i.e. they use the same filtering or limiting marking) then they will require identical queries. The Spotfire web player will detect that and will not issue two identical queries to SFDS. However, that assumes that the users have the same credentials for SFDS (i.e. use a common service account or no authentication). We will discuss this more in the section Query sharing.
SFDS will also detect if it receives two identical queries and share the recurring computation of the query result.
Controlling update frequency
SFDS periodically computes and by default sends query updates every second, if needed. There is one exception: a Spotfire table visualization displays the underlying rows in the SFDS data table. To do that it uses a so called unaggregated query. The updates for unaggregated queries are sent from SFDS to Spotfire immediately.
By default, Spotfire will render a visualization (if needed) once every second. However, it is possible to reduce the rendering frequency in Spotfire so that visualizations are rendered less often. That can make a huge difference on the amount of CPU that the web player consumes. Thus if your use case does not require updates every second you should definitely consider to reduce the update frequency
Finally, it is also possible to configure SFDS to compute and send updates less often. That will have the effect that Spotfire will render less frequently, but it will also lead to spikes in web player CPU usage. Make sure to read the section Reducing frequency so that you understand the consequences before you use that to reduce resource needs of SFDS
Rendering visualizations
The cost of rendering depends on how often visualizations are rendered, but it naturally also depends on the number of visualizations and the number of visual elements in the visualizations.
Spotfire only renders visualizations that are visible. Thus to reduce the cost of rendering, you may want to avoid many visualizations on the same page and instead split the dashboard into more pages with fewer visualizations on each page if this is acceptable for your use case.
It is also important to note that Spotfire will not render visualizations if the browser tab is not visible to the user. Thus when you estimate the number of simultaneous users, you should estimate how many users that simultaneously will have the dashboard visible in their browser.
Detecting if a browser tab is visible to the user is based on standardized browser APIs that signal if the browser tab is not active, if the browser window is minimized, or if the user locks the screen. Unfortunately the current major browsers do not correctly signal if the computer is locked. Also note, that if a user has an old browser, these APIs may not be supported.
It is also worth noting that Spotfire will only issue queries towards SFDS for visualizations that are currently visible.
Finally, note that Spotfire does not share the rendering of a visualization between multiple users of the same dashboard, even if the users look at identical data (i.e. use the same filtering).
User interactions
Spotfire analyses are by default very interactive, allowing end user marking, drill down and filtering. When a user interacts with Spotfire then it naturally increases the load on the web player. However, that cost will typically be dwarfed by the cost of rendering the streaming updates.
Switching pages
For example, if the user switches to a new page, then Spotfire will need to render all the visualizations on that page. However, Spotfire will render all those visualizations for every streaming update anyway.
The situation is different for SFDS. When a user switches to a new page, Spotfire will issue the queries for the visualizations on that page. SFDS will need to compile those queries and compute the initial result of the queries. To compile a query and compute the initial result can be costly and the time it takes for a visualization to be rendered after switching to another page will be affected by that.
SFDS keeps a cache of compiled queries so if a user switches back and forth between pages, there is no cost for the compilation of the queries.
Filtering and marking
Similarly, when a user filters or drills down using marking, then all the affected visualizations will issue new queries. Again, this primarily increases the load on SFDS and does not affect the web player that much.
Filtering and drill down also has another very important implication. If each user filters or marks differently, then it will lead to different queries for each user. The effect for the web player is not that dramatic; it leads to more open queries towards SFDS, but the amount of rendering will not be affected. However, it may affect the load on SFDS dramatically. It is no longer likely that a query can be shared between users or that a query will be found in the query cache. Thus if there is extensive use of filtering or drill down, then it is likely that the number of queries will be proportional to the number of users. Thus you need to take the number of users into account when you size SFDS. In contrast, if all users view the same data then a small instance of SFDS can handle many users.
Rendering Filters
Filters also need queries in order to render the values in the filter. For example, a list box filter queries SFDS for the values in that column. A filter will need to be rendered whenever the result of its query changes.
It is often the case that the values of a column does not change. Consider, for example, a list box filter which shows the products from a table of sales transactions. Even though the table is updated frequently with new sales transactions, the set of products will change rarely. Thus you can neglect the cost of such a filter.
In contrast, if you have a range filter, then the range filter will issue a query for the minimum and maximum value in the column. Suppose that column contains the timestamps for the sales transactions. The maximum value for that column is likely to change whenever a new sales transaction arrives and thus the range filter will need to be rendered frequently.
Brush linking
Brush linking is when a marking in one visualization is rendered as a marking in another visualization and is a great analytical feature since it helps understanding relations in your data.
For example, suppose that a table of sales transactions has information about which product the transactions concern and the business unit that is responsible for the sold product. If you mark a product in one visualization that shows the sales for each product, then the business unit that is responsible for the product can be shown as marked in another visualization for the sales of each business unit.
This requires that Spotfire sends a query to SFDS. There is naturally a cost for that query itself, but the query will also trigger a rendering if the result of the query changes. In the example above, the query result is not likely to change since the responsibility of a product rarely changes.
Whenever you use brush linking, you should consider whether it is important. Brush linking is on by default so dashboards may contain it even though it may not be strictly necessary.
Sizing Spotfire
In order to give end users of streaming visual analytics applications and dashboards a good user experience it is important to size the system and its components appropriately. This section gives some advice on how to size Spotfire so that it scales to the number of estimated users. Note that sizing these kind of systems is a complex task. In practice, you may have to make an initial sizing, measure and then adapt the system as the measurements indicate.
Also note that this advice is primarily for streaming data. If your dashboard uses a combination of streaming data and non streaming data, then you also need to account for the requirements of the non-streaming data.
The following diagram shows a simplified overview of the different components of a Spotfire deployment with SFDS.
You can size Spotfire by controlling the number of Spotfire servers (TSS1...TSSn) and the number of Spotfire web player service instances (WP1...WPn).
Technically a web player instance is similar to the Spotfire Analyst client (the installed Windows client) but it can handle many users and analyses. The computationally intensive work, including the rendering, takes place in the web player. In general, analyses using streaming data will use more CPU and less memory than analyses using in memory data. See the section Memory vs CPU.
All traffic between the end user’s browser and a web player goes through a TSS. Note that an analysis with streaming visualizations generates much more network traffic than an ordinary Spotfire analysis. That traffic will put more load on the TSS. However, the web player is more likely to be the bottleneck, so we focus on the web player in the rest of this section.
When a user opens an analysis, then the user will be routed to a specific web player instance. The details of the routing algorithm and the ways to configure it is beyond the scope of this document (see here for more information). However, we note that it is important that the routing works well so that the load is evenly distributed over the web players.
Estimating usage
The most difficult thing when sizing a system before deployment is often to estimate how much and how it will be used.
An important thing to keep in mind when you estimate the usage, is that Spotfire will only render visualizations that are visible. In particular this is the case if the browser window is currently not visible. See the section Rendering visualizations
However, when you calculate how much memory the web player will need, then it is important to also include users that temporarily do not have the browser tab visible.
Another thing to keep in mind is that the different pages of the dashboard may put different load on the system. Thus to be accurate, you need to guess how much the different pages will be used and take that into account.
It may also matter what the user marks or filters because that may affect the number of visual elements in the visualizations.
Hardware for the web players
Memory vs CPU
As a rule of thumb, streaming visualizations consume relatively more CPU and less memory than visualizations based on data that is imported into the Spotfire data engine. The reason for this is twofold. Firstly, imported data consumes memory. Secondly, the rendering that is induced by streaming data updates consumes more CPU. Thus it is very likely that a single web player serving analyses with streaming data that updates frequently will be able to handle fewer analyses than a web player serving analyses using in memory data, and will therefore need less memory.
However, note that by reducing the update frequency of the streaming data, a single web player may be able to handle a large amount of users and in such cases the memory may still become the bottleneck. An example of this can be found in our measurements. See the section Running low on memory.
Also recall that the visualizations are only rendered if the browser window is visible (see the Section Rendering visualizations), but they still use up memory in the web player. Thus if most users do not have the browser tab visible at all times, then the web player may be able to handle the CPU load for a large number of users. Again memory may become the bottleneck.
Many cores vs few cores
An important decision when you decide which hardware to use is whether you should buy powerful machines with many cores or more machines with fewer cores.
If you have an analysis which imports a large amount of data, then it is usually a good idea to put that analysis under scheduled updates. If you put an analysis under scheduled updates then all users of that analysis will be able to share the imported data provided that they are routed to the same web player. Thus it may be a good idea to have a powerful machine with many cores that can handle as many users as possible.
However, there are also a number of drawbacks if you run a single web player on a powerful machine with many cores.
If a single web player handles many users then the .NET heap may become very large. Occasionally the memory manager may need to pause the execution to garbage collect that large heap. These pauses tend to be longer (but occur less frequently) if the heap is larger. These pauses can be more noticable for streaming data because the user expects periodic updates.
Another potential problem when using a powerful machine is that it can lead to contention. Contention means that you cannot utilize the CPU fully even though there is work to be done. The reason behind this is, simply put, that some threads cannot execute because they are waiting for a lock that is currently held by another thread. If you have many cores, then this is much more likely to be a problem.
Finally, machines with many cores are often so called NUMA machines. Simply put, it means that the memory is partitioned so that some parts of the memory can be accessed quickly by some cores but slowly by others. This particularly affects locks because they need to do synchronized memory access. The effect is that a single web player sometimes cannot effectively use all cores.
We have made some measurements in the section Contention which illustrate that the latter two problems can have a large impact on the number of users that a web player can handle. We strongly recommend that if you have a machine with many cores, then you should run several web player instances on that machine in order to minimize those issues. Naturally, some of the advantages like data sharing between users with these powerful machines go away if you run multiple web players on the same machine. Thus, in general, we believe that it is most often best to buy more less powerful machines than a single very powerful one.
Sizing the web player
If you need to plan how much web player hardware you will need for a particular dashboard, we recommend that you take the following steps.
First you need to make a preliminary decision on what hardware that you will use.
Secondly, you will need to measure how many users of the dashboard that one such web player machine can handle before the CPU is exhausted
Note that you can most likely not accurately estimate how much resources that will be needed without actually measuring on your specific dashboard. You can learn from the measurements that we report on in this document, and draw some conclusions based on that, but you cannot expect it to accurately reflect the performance in your use case because there are so many factors that affect it, such as data and dashboard characteristics and usage patterns.
After you have measured, you can then simply calculate how many web players that you need, based on the estimated usage.
You may also want to adjust the amount of memory that each web player machine should have based on the results. Note that when you decide on how much memory you will need, then you should base it on how many users that will have the analysis open simultaneously, regardless if the browser tab is visible or not.
Simulating streaming data
The best thing is to use real production data for you measurements. An important thing to consider is that the events that update the data in SFDS are not necessarily evenly distributed throughout the day. Thus it may be important at what time of the day that you do the measurements.
If you don't have production data and need to simulate the data, then it is important that the visualizations contain roughly the same amount of visual elements as you expect them to do with production data.
It is also important that the data is updated so that the visualizations are rendered as often as you expect them to do for production data. Note that the visualizations are not necessarily rendered just because the data in the SFDS table is updated since it might not affect the query for the visualization.
Simulating users
In order to simulate users you may also simply manually open many instances of the dashboard as long as you keep the browser windows visible, and there are also tools that can be used for this purpose. However, one thing to keep in mind is that if you have many browser windows running on the same machine, manually or using a tool, then the browser may become the bottleneck. The web player will notice if the clients cannot receive the rendered results and skip rendering some updates of a visualization if the browser cannot process the results fast enough. The effect will be that the web player will use less CPU and thus the result of such a measurement may be misleading. It is therefore important that you validate the results using some key performance counters. We will discuss that further in the section Validating the measurements.
It may be tricky to simulate user interactions such as filtering, marking and other user interactions. We believe that for streaming data it is most often possible to size the web player machines without taking any user interactions into account, simply because the cost of rendering dominates the CPU usage. Note that this may not necessarily be true if you limit the update frequency so that the visualizations render rarely.
It is necessary to take into account that the different pages may require different amounts of CPU. You can handle that simply by running separate tests for the separate pages of the dashboard. You just need to save different versions of the dashboard with different active pages. You will then need to somehow average the results based on an estimate of which pages are mostly frequently used.
If the dashboard has a details visualization, then you need to make sure to mark something so that the detail visualization is not empty. You may also want to make sure that you make a typical filtering.
When you simulate users, it may complicate matters if you have to use multiple user accounts with different credentials. You can work around that if you use the same SFDS user account for all users or set up a SFDS test server with no authentication. However, if you run the tests in that way it may affect the outcome since the web player query sharing will kick in. Query sharing typically does not have dramatic effects on the web player. See the section Query sharing, for an example. However, if the intention is that all users in production will use different accounts, then we recommend that you turn off query sharing when you do the measurements.
Another aspect to take into account is if you expect that most users will mark and filter differently. Even though the rendering typically dominates, so that the user interaction itself can be ignored, it will have the effect that the web players will have more unique queries towards SFDS to handle. Those additional queries add an overhead whenever a new incremental update for a query is received by the web player. You can simulate the additional queries by turning off query sharing. Then each visualization instance will have its own query even though it is identical to the corresponding query for another visualization instance. That will create a similar load on the web player as the multiple different queries that you can expect from users marking or filtering differently.
Finally, we recommend that you ramp up the number of users in a controlled and reasonable way during your simulations, especially if you make measurements using more than one web player machine. If you ramp up the number of users too quickly, it may have the effect that the users are not routed evenly. The reason is that the routing is affected by the CPU usage, and it may take a few seconds before an analysis consumes CPU since it takes some time for the initial query results to arrive.
We also recommend that once you have ramped the user to the desired level, then you keep that level for a while before you ramp down the number of users. To be able to draw firm conclusions from the measurements, we advise you to first run a test with a relatively small number of users and validate the results. You can then guess roughly how many users that you will need to get to about 80% CPU usage.
You should then rerun the tests and ramp the number of users to that level and keep it there for a while. You can then validate that you get a steady state of about 80% CPU when the number of users is constant. You can then use this number when you calculate how many web players that you will need.
Do not try to find the number of users that will lead to 100% CPU usage. The system will not perform well when it is close to 100% CPU.
Validating the measurements
It is important that you validate the results of the measurements using Spotfire performance counters, especially if you simulate users automatically. Otherwise you may draw conclusions based on flawed measurements.
We will give an overview in this section, but we will return to this in more detail in the sections about our measurements and show examples of what the performance counters should and shouldn’t look like.
Logging the relevant web player performance counters
A small set of performance counters are logged by default every 120 seconds to a service log located at the web player machine, named PerformanceCounterLog.txt. For the kind of analysis we want to do, a more detailed logging is required. It is possible to enable the DEBUG level for the performance counter log which will give us what we need (more information here). A much larger set of counters will be logged every 15 seconds by default. Both the set of counters and the frequency is configurable.
The performance counter log files can be imported into Spotfire directly. If more than one file is needed, just import the first one and then add the other ones in the same table. Analyzing the counters in Spotfire is easily done with, for example, a scatter plot that is trellised on counter category, counter name and counter instance. Make sure to use multiple scales on the Y-axis.
Open documents
An important thing to validate is that the expected number of documents was actually opened. You can do that by looking at the performance counter "# open documents". You should verify that it has ramped up to the expected number of documents and stayed there for a while before it ramped down as expected.
If you have multiple web player services serving the same analysis, make sure to verify that the analyses are distributed between the services as expected. If not you may be ramping up too fast.
Updates per second
Another thing that you would like to verify is the counter "# streaming property updates per second". This is a measure of how many streaming updates that Spotfire takes into account. If you reduce the update frequency using throttling in Spotfire (see the section Reducing frequency) then this number should be reduced.
You should verify that the value of the counter is proportional to the number of users. Thus it should grow linearly if you scale the number of users linearly. If the machine becomes overloaded, then you may see that the counter does not grow linearly when you increase the number of users.
The value of the counter may vary quite a lot over time if you measure on actual production data. However, if you simulate the data and it actually updates every second, then you can calculate what you expect the value to be and check that it has the expected value.
Simple visualizations will generate one query. If the result of that query is updated every second, then the value of the counter should be the number of visualizations on the page multiplied by the number of users.
Note that the map chart generates multiple queries and that filters and brush linking also generate queries, so it is not always possible to estimate the value of the counter easily.
Renderings per second
Another counter that you should verify is "Visualization renderings per second". You should verify that the value of the counter is proportional to the number of users. Thus it should grow linearly if you scale the number of users linearly. If the machine becomes overloaded, then you will see that the counter does not grow linearly when you increase the number of users.
The counter often corresponds directly to "# streaming property updates per second" if the web player is not overloaded. If the web player system gets overloaded, then it will skip renderings and this counter will fall behind "#streaming property updates per second".
Note that if you use a filter then "Visualization renderings per second" may not directly correspond to "# streaming property updates per second" because the query for rendering the filter affects "# streaming property updates per second", but it does not lead to a visualization rendering.
CPU load
The counter "Web Player average processor %" shows the amount of CPU that the web player process consumes and the counter "Total average processor %" shows the total amount of CPU that is consumed on that machine. These counters should normally grow linearly with the 3 number of users. It may be the case that they grow super linear when the CPU load gets close to a 100%.
It can be the case that the value of the counter grows sublinear and never gets close to a 100% CPU even if the number of users is increased. If that is the case, you will see that "Visualization renderings per second" does not grow as expected when the number of users increase. This is a sign of contention and often happens on powerful machine with many cores. If this happens, then you should try to increase the number of web player services that run on that machine.
Available Memory
You may also want to look at the counter "Available Bytes". If this counter becomes too low then you have too little memory. At 15% available, Spotfire starts to page out data (and possibly paging in data that is really needed at the same time) and when getting even lower, Windows will start paging and the system will work very poorly. In those cases the system will not be able to perform well and you cannot draw any accurate conclusions about CPU usage.
You can also use this counter to estimate the amount of memory that you will need. When you do your simulations, then we recommend that all users have their browser tab visible. However, in real life you will likely have users who temporarily have their browser tab hidden. You want to make sure that there is enough memory available for those users.
Spotfire Server (TSS) Hardware
In general, in web player scenarios, each interaction generates requests to and from a web player that passes through the TSS instances. In a streaming scenario, each update can be seen as an interaction, thus increasing the traffic that the TSS instances need to handle. This traffic mainly consumes network and CPU, some memory and generates a few or no additional database queries to the TSS database.
TSS database queries related to the web player are typically performed when a user logs in, opens and saves an analysis file, but since the database is such a crucial part of the system, it is important to monitor how increased traffic affects the database.
Above a certain threshold, the TSS typically doesn’t make effective use of more memory for web player traffic, so when the concurrent traffic increases, the number of cores and CPU single thread performance become more important.
Similarly, it is usually better to increase the number of TSS instances and having a load balancer that the users connect through, than having one machine with “too many” cores, e.g. > 16, if such a machine comes with a larger amount of memory that is not used much by web player traffic as discussed above.
If the expected or future traffic is to a large extent unknown or hard to predict, it typically a good measure to have a load balancer anyway to make it easier to add another TSS instance when needed. There are other good reasons for having a load balancer or reverse proxy in front of the TSS instances as well, apart from being able to change the number of instances or hardware easier, but that is not discussed in this document.
Important TSS performance counters and metrics to look at when validating the measurements
- MEMORY::Memory.HeapMemoryUsage.used This is the memory used by the TSS jvm
- OS::OperatingSystem.ProcessCpuLoad and OS::OperatingSystem.SystemCpuLoad This the CPU load of the TSS process
- SpotfireServer::DataSource.CurrentActiveCount::server.default This is the number of currently active database connections to the TSS database, a higher sustained number can point to long running jobs that use a connection or that there is a lot of activity that uses the database. But as discussed above, not typically for web player traffic that corresponds to interactions or streaming updates.
- WebServices::Requests.current::WebPlayer The number of concurrent requests related to web player traffic.
Summary of Experimental Results
This section contains the results of some measurements that we have done. We provide them here to illustrate and support some of the advice in this document. For sizing your system it is important that you make your own measurements using your own data and analysis etc.
The dashboard
We have made our measurements on a simple dashboard which has just one page with four visualizations.
The dashboard contains a line chart, a bar chart, a scatter plot and a cross table.
The dashboard shows simulated sales data. The simulation updates the underlying data in SFDS once every second by adding one row and removing another.
The dashboard has no filtering so the queries that are issued for the visualizations depend on all of the data in the simulation. Thus SFDS will incrementally compute a new result for all of the queries every second. The result is almost always different from the previous result, so in practice Spotfire will receive new results for all of the queries every second.
The number of visual elements in the visualizations varies just slightly over time depending on the data in SFDS. Thus we can expect that the CPU that is required to render a visualization is the same for every update.
All users will generate identical queries to SFDS since there is no filtering or marking. We used a service account in SFDS for all of the users, so the web player will share all of the queries between the users.
The intention with this dashboard is to provide stimuli for making streaming related performance measurements, so it may not be representative of your application. For example, the dashboard has only one page and it does not use any imported data
The hardware
TSS hardware
We used two TSS instances, running on servers with 16 GB RAM and 8 physical cores each (16 logical cores) on Windows Server 2019. As a load balancer haproxy 1.8 was used running on a server with 4 physical/logical cores and 8GB RAM on CentOS 7.7.
Web player hardware
We used two different kinds of machines for the web player in our experiments. These were machines that were available in our lab, we did not choose these machines because they were particularly suited and it should not be taken as a recommendation to use similar machines.
We used 5 small machines. They each had 16 GB RAM, 4 physical cores (8 logical cores), and they were running Windows Server 2019. (Standard desktop machines with Intel(R) Xeon(R) CPU E3-1270 v3 @ 3.50GHz)
We also used one large NUMA machine. (A Dell PowerEdge R630 Server with 2 x Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz) It had 256 GB RAM, 2x18 physical cores (72 logical 5 cores), two NUMA nodes, and was running Windows Server 2019.
Note that the number of cores is not necessarily a good measure on the capacity of a machine. We ran an experiment on an old machine with 4 physical cores and it was considerably worse than the small machines that we used in the experiments that we summarize here.
Simulating users
In addition we also used several different kinds of machines to simulate the end users. To simulate the end users required more hardware than we used for the TSS and web player since each user is actually running a web browser just like a real user. In total we used a cluster of up to 21 machines although these machines were generally weaker than the machines used for the TSS and web player.
Summary of results
We summarize the results of our measurements in this section and provide more details in the following sections.
The bottlenecks in many of our simulations were the machines that simulated the end users. We were able to simulate enough users so that we could overload the web player both when we used five small machines as well as when we ran the web player on the single large machine. However, we were not able to put enough load on the system so that the two TSSs became overloaded.
One small machine
When we ran the web player on the small machine it was able to handle 45 simultaneous users having the analysis visible. We estimate that it has enough memory so that it can handle at least a hundred users having the browser tab hidden. There was no major difference between running one or two service instances on this machine. See the section The hardware for hardware details.
The large NUMA machine
We ran five instances of the web player on the large NUMA machine. The web player was then able to handle 200 users, simultaneously having the analysis visible. We estimate that it has enough memory so that it can handle several thousands of users having the browser tab hidden.
It is crucial to run multiple web players on the large NUMA machine. If we run just one web player, then it runs into contention problems and can only handle 75 simultaneous users.
We provide more details for the experiments in the section The large NUMA machine.
Five small machines
We also made an experiment to validate that it was possible to scale out by adding more web player machines. We used 5 small machines (same hardware as the one that was used earlier) which ran one web player each. The result was that it scaled perfectly. Each web player could still handle 45 users and the system as a whole could handle 225 users.
Reducing frequency
We made two experiments. We reduced the update frequency so that the visualizations updated every third second and every tenth second.
The result was that the web player could handle almost three times and ten times as many users respectively. However, there was one exception. The small machine could not handle ten times as many users because it ran out of memory. The following table shows the number of users that the web player can handle in different cases. For reference, we also include the results for the default setting of updating every second.
Every second | Every third second | Every tenth second | |
Small machine | 40 (45) | 100 (135) | Out of memory |
Large machine | 180 (200) | 540 (600) | 1800 (2000) |
Details of Experimental Results
In this section we will describe more details on some of the experimental results. We will explain some of the settings in more detail and we will illustrate how we validate the experiments using performance counters.
One Small machine
This section details the experiment where we ran one web player on a small machine that we describe in the section web player hardware. The result is that one small machine can handle 45 users. We will use this as an example of the methodology that we recommend in the section Validating the measurements.
The following graph shows the key performance counters that are described below.
We ramped up the 45 users during 15 minutes, kept a constant load for 25 minutes and then the number of users dropped off during 15 minutes. The users are started at random points in time using a random uniform distribution which is why the graph showing the number of users is not a perfect straight line
It is crucial to verify that the users have successfully opened the document by looking at the performance counter “# open documents”. It is still possible to draw conclusions about scalability even if a few users have failed to open the document so it does not have to be perfect.
It is also very important to verify the counter “# streaming property updates per second”. The dashboard has four visualizations which issue one query each. The simulation of the data ensures that the result of all queries will be updated every second. Thus, in this case we verified that there were approximately 4*45 updates per second when all users have started.
If the value of this counter does not match the expected value, then the root cause may be that the browser tabs are not visible, that a web socket connection has not been set up properly, or that the machines that run the browsers for the simulated users are overloaded.
Note that the value of the performance counter is oscillating somewhat. The reason for that is that the value of the performance counter is sampled every 15 seconds. The updates come in bursts and if the performance counter is sampled in the middle of a burst then it will naturally vary.
The next thing we verified was the performance counter for “Visualization renderings per second”. It should scale proportionally to the number of users. For this dashboard we could verify that it matched “# streaming property updates per second” since the dashboard has no filters or brush linking.
If “Visualization renderings per second” does not scale as expected, then it is most likely because the web player cannot handle the amount of users. However, it can also be the case that the machines that run the browsers for the simulated users are overloaded.
Since these performance counters are fine, we drew the conclusion that the simulation went fine and that the system was able to cope with the amount of users. The next thing we checked was how much CPU that was used. We recommend that the amount of CPU does not exceed 90%.
If the amount of CPU is low then you should increase the load by simulating more users. It is usually possible to guess roughly how many users that the system will be able to handle by simply assuming that the CPU load will scale linearly with the number of users. However, this is not always the case because of contention so we recommend that you actually increase the load rather than guessing. We will show an example of contention and the remedies in the section Contention. It may also be the case that the system has too little memory to handle more users. We will show an example of that in the section Running low on memory.
The counter “Available Bytes” shows how much memory that is available on the machine. In this case we estimate that the machine has enough memory to handle at least hundred users (but the CPU can only handle that 45 of them have the browser tab visible at the same time).
The large NUMA machine
This section details the experiment where we ran the web player on the large NUMA machine that we describe in the section web player hardware.
Contention
One key result is that there are serious contention problems when running one web player on that large machine. We instead ran 5 web players on the same machine which eliminated the contention problem. The machine could then handle 200 users.
The following graph with the key performance counters, illustrates the contention problem that we saw when running just one web player on the large NUMA machine.
The simulated users are able to open the document correctly and the counter “# streaming property updates per second” looks good too.
However the counter for the renderings clearly does not scale proportionally with the number of users. The counters for CPU show that CPU usage does not grow proportionally to the number of users. However the CPU usage continues to grow even though the number of renderings actually drops with increased load. This pattern is typical for a contention issue.
The work around for a contention issue is to run more web players on the same machine. The following graph shows one of the web player instances when we ran 5 web players with 200 users.
Note that the counter “Total average processor %” shows the total CPU usage on the machine and “Web Player average processor %” shows the usage for the process for a particular web player instance. Note that it is important to verify that all the web player instances work as expected and that the number of open documents sums up to the expected number of users
One key thing to keep an eye on is that the users are routed so that each of the web player instances has roughly an equal number of users. If the users are not evenly distributed, then the contention problem may show up on web players that get to handle too many users.
The routing is based on the load of the machine rather than the load that a particular web player instance handles, which may be problematic in this case. It is possible to work around it by modifying the threshold for when a machine is considered to be strained. We did it by setting 6 an artificial limit on the amount of memory. The effect is that all the services are considered to be strained even when there is no load. That affects the routing so that users are routed evenly between the services.
Lowering the strained level actually makes sense in the streaming use case. With in-memory data it is beneficial to ensure that as many users as possible are routed to the same service to utilize the highly effective data sharing, but with streaming this is less important than the fact that spreading the users result in a better CPU utilization. It is also possible to achieve this by adding the analysis to scheduled updates, but that may add an overhead.
During this test where the large web player server was fully utilized in terms of CPU, the two TSS servers running on much smaller hardware managed the load without any problems. See the graph below when running 200 users with streaming updates every second. Running 10 times more users with updates every 10 seconds give similar results.
Finally, we should mention that we tried to run more than 5 web players on the machine, but that had no additional positive effect. The best number of services depends on the hardware and especially on the number of cores, but also on the usage pattern so there is no general rule that works in all cases.
Reducing frequency
In this section we discuss the experiments to reduce the rendering frequency in Spotfire.
The setting is per data connection. Thus if you want visualizations from the same SFDS data table to update with different frequencies, then you need to create multiple data connections.
To change the setting for the frequency, you need to open the “Data Connection Properties” in the “Data” menu. You then select the data connection streaming connection and press “Settings”. In the new dialog you select the tab “Streaming Data Settings”. You can then choose to limit the update frequency.
There is also a check box “Distribute the updates over the interval to balance load between users”. We strongly recommend that you keep this checkbox checked.
If you uncheck this option then Spotfire will render the visualization for all users at the same point in time. Thus if you reduce the update frequency to, say once every tenth second, then the load on the server will peak every tenth second. This may have dramatic effects if this dashboard is used by many users and it is the only dashboard that is handled by the web player.
The effect of the spikes in load is that Spotfire once every tenth second will be very busy. If you load the web player so that it consumes, say 80% CPU, then it will be overloaded for 8 seconds and then idle for 2 seconds. This will lead to slow response time for any user interactions that take place at the spike in the load. Another effect is that all the renderings at the spike will compete for the same resources, so the renderings will be delayed up to 8 seconds. Thus you should not uncheck this option unless you have a lot of resources on the web player, or if you run a mix of dashboards so that the spikes do not matter.
If you keep the option checked, then the updates will be distributed at different points in time for different users. Thus if an important event takes place, then some users will see the result faster than others. However, it is fair in the sense that different users will see different important events first depending on when they take place.
This feature relies on the fact that Spotfire will receive updates from SFDS every second. It is possible to configure SFDS to compute and send updates less often. Thus if you configure SFDS to do that it may have serious impacts on the performance of the web player.
If the user filters or marks in a master visualization, then the web player will issue new queries to SFDS. The web player will immediately render the initial result when it receives it from SFDS so that the user will get a response as soon as possible. The web player will also immediately render all other streaming visualizations (from the same SFDS instance) if the rendering of the visualization has been put on hold because of the reduced rendering frequency. The reason for that is that the user would otherwise often see inconsistent visualizations that render data from different points in time. This feature also relies on the fact that Spotfire will receive updates from SFDS every second.
Running low on memory
In most use cases where streaming data is used, CPU is the limiting factor, but there can be cases where memory is the limiting factor that causes the system to fail to handle the load.
In one of the experiments when we reduced the rendering frequency so that it could handle more users, it turned out that the small web player machine had too little memory to be able to handle all those users. See the blue line in the graph below; it indicates when the amount of available RAM went below the threshold where the services started to page out data to disk
After some time (the red line), the data was needed and had to be paged back in. Since this heavy paging in and out consumes both CPU and disk, the service will not be able to keep up the rendering frequency which becomes very clear after the in-paging has started. If it gets even worse, Windows will start to page out parts of the process
Recommended Comments
There are no comments to display.