Splunk Cache Drop Alerting for Spotfire Scheduled Updates

This article explains an alerting mechanism to notify your administration team when a Spotfire dashboard drops out of RAM cache so they can act on it promptly before users report it and before the issue cascades. The principles could also be applied to logging frameworks other than Splunk.

This article has been kindly and expertly contributed by Spotfire user Paul Hallimond and reviewed by the Spotfire team. If you are a Spotfire user interested in contributing community articles about solutions that you developed that can help other Spotfire users, please contact community@spotfire.com and we will help publish your content!

Problem Statement

In Spotfire, you can create scheduled updates to cache dashboards in RAM so that very large dashboards containing millions of rows of data load quickly and are responsive to the users. Some of these dashboards can take a long time to run and can cache 100GB+ of data in RAM. From time to time a dashboard can drop out of RAM for various reasons such as multiple refresh failures, memory exhaustion, VM Host failure, etc.

The issue is that end users often can realize this before your administration team. Worse, if the users keep trying to open the dashboards it generates additional query load on the backend data sources and the Spotfire environment itself that can cascade into a larger issue.

Solution Description

This article explains an alerting mechanism to notify your administration team when a dashboard drops out of RAM cache so they can act on it promptly before users report it and before the issue cascades. This is especially useful in large clustered environments where the cached documents are stored on multiple nodes.

Solution Assumptions

You have the Splunk Universal Forwarder installed on all web player nodes. However, the principles of this solution could also be applied to logging frameworks other than Splunk. This solution was created with the below products as things change over time.

Spotfire 12.0 LTS
Splunk Enterprise 9.1.2

Please use the Spotfire Community Forum and reference this article if you have any questions or if you have additional ideas to contribute.

Solution Approach

Step 1 - Identify how to detect the cache drop

The foundation of the alert will be built using information in the Spotfire logs from the web player tier. In addition, a new log file is needed to identify all the dashboards that are currently running as a scheduled update.

The existing log file is named DocumentCacheStatisticsLog and the key field to use is referenceCount which is a count of the concurrent open references to a document which for a document running successfully as a scheduled update should always be greater than zero.

The new custom log that is required we named ScheduledUpdates and was created by editing the existing log4net.config file and adding a new appender section for Logger Spotfire.Dxp.Web.Library.ScheduledUpdates. More details on how to edit the configuration file log4net.config can be found here:

https://docs.tibco.com/pub/spotfire_server/latest/doc/html/TIB_sfire_server_tsas_admin_help/server/topics/customizing_the_service_logging_configuration.html

The new section we added is shown below. If you have more than one node running web player you will need to make this change on each node

<!-- New Appender for Logger Spotfire.Dxp.Web.Library.ScheduledUpdates -->
       
        <appender name="ScheduledUpdateAppender" type="log4net.Appender.RollingFileAppender">
            <PreserveLogFileNameExtension value="true"/>
            <file type="log4net.Util.PatternString" value="..\..\logs\ScheduledUpdates%property{serviceIdWithPeriod}.txt"/>
            <appendToFile value="true"/>
            <rollingStyle value="Size"/>
            <maxSizeRollBackups value="4"/>
            <maximumFileSize value="500MB"/>
            <staticLogFileName value="false"/>
            <layout type="log4net.Layout.PatternLayout">
                <conversionPattern value="%-5level %date [%property{pid}, %thread, %property{user}] %logger - %message%newline"/>
            </layout>
           
        </appender>    
        <logger name="Spotfire.Dxp.Web.Library.ScheduledUpdates" additivity="false">
            <appender-ref ref="ScheduledUpdateAppender"/>
            <level value="INFO"/>  <!-- Change INFO to DEBUG for DBEUG level Schedule Update logging -->
        </logger>

After completing the changes make sure the logs are being created as expected, especially for the new custom log file
These above changes are needed for each node in your environments web player tier and require a restart to take effect.

Step 2 - Splunk Universal Forwarder Configuration

Assumption is that you have worked with your Splunk administrator to setup the access and indexes you need to be able to proceed with below steps.

The Splunk Universal Forwarder is used to forward the logs from each web player node in an environment to a central Splunk location and from there we can create our cache drop alerting.

To set this up we need to make certain configuration changes as described below:

The directory where the Splunk universal forwarder configurations can be found is under: <Splunk Installation directory>\SplunkUniversalForwarder
To add any new application like Spotfire to be able to forward log files to Splunk a new folder would be needed under the folder: <Splunk Installation directory>\SplunkUniversalForwarder\etc\apps
So to add a new Spotfire application to Splunk we create a new folder, for example, 200_Spotfire where we can define our configuration files for sending Spotfire logs to Splunk.

Next you can copy the configuration files from another application to get started but you'll eventually have a local folder and under there the various configuration files.

The key file we need to adjust is inputs.conf. This is where you define which log files we want to forward to Splunk. In our case we want to send the files described above from each web player. Note the wildcard. When the size of a log file gets large enough Spotfire will rename that file and start a new one.

Your values will depend on your Spotfire and Splunk installation, below is for an example.

[monitor://E:\tibco\tsnm\12.0.8\nm\logs\DocumentCacheStatisticsLog.*.txt]
disabled = false
index = entrpt
sourcetype = entrpt:spotfire:documentcachestatistics

[monitor://E:\tibco\tsnm\12.0.8\nm\logs\ScheduledUpdates.*.txt]
disabled = false
index = entrpt
sourcetype = entrpt:spotfire:scheduledupdates

After these files have been setup the Splunk Forwarder Service needs to be restarted.

Step 3 - Log Ingestion Validation

You will need to spend some time validating that the log files are being ingested correctly so you can be confident any reporting or alerting based on that data is accurate. Initially this will all be done in the Splunk UI.

Splunk UI Validation:

Probably a good first step is to validate each of the sourcetypes contain data and within those sourcetypes you are seeing data for each of the Spotfire web players you are ingesting from.

You can quickly determine if data for each Spotfire node (host field below) is coming through into Splunk via the search UI. Below host names are examples only.

More information about basic searches in Splunk can be found here:

https://docs.splunk.com/Documentation/SplunkCloud/latest/SearchTutorial/Startsearching

Spotfire Validation:

Another approach is to extract some of the ingested data directly out of Splunk and then load that and the same data directly from the log file(s) in Spotfire Web Player and build a quick comparison report in Spotfire.

Below is an example of what that might look like.

Step 4 - Creating the searches that will be used in the Cache Drop Alert

Again referring to the Splunk documentation for building searches.

https://docs.splunk.com/Documentation/SplunkCloud/latest/SearchTutorial/Startsearching

For our example we are answering two questions.

Get a list of all dashboards that are scheduled

index=entrpt sourcetype="entrpt:spotfire:scheduledupdates" earliest=-1d@d latest=-0d@d | rex field=_raw "(?<uri_path>/.*?)\s(?<rid>[a-z0-9\-]+)," | table uri_path

Of those dashboards determine which has dropped out of cache

This search will combine a search with a join to only match on those dashboards that are running on a schedule as noted in search above

index=entrpt sourcetype="entrpt:spotfire:documentcachestatistics" | rex field=_raw "(?<Level>[^;]+)\s;(?<HostName>[^;]+);(?<Timestamp>[^;]+);(?<UTCTimestamp>[^;]+);(?<uri_path>[^;]+);(?<ModifiedOn>[^;]+);(?<ReferenceCount>[^;]+);(?<InstanceId>[^;]+);(?<ServiceId>[^;]+)" | stats avg(ReferenceCount) as avgReferenceCount by uri_path | search avgReferenceCount <=-1 | join uri_path [search index=entrpt sourcetype="entrpt:spotfire:scheduledupdates" earliest=-1d@d latest=-0d@d | rex field=_raw "(?<uri_path>/.*?)\s(?<rid>[a-z0-9\-]+)," | table uri_path] | sort uri_path | dedup uri_path | table uri_path, avgReferenceCount

Once you are satisfied with the search and the results you are getting the next step is to build the alert.

Step 5 - Creating an alert

Assuming you have the permissions needed the easiest way to create the alert is directly from the search screen

Details can be found here:

https://docs.splunk.com/Documentation/SplunkCloud/latest/Alert/Definescheduledalerts

Using the above guide you can create an alert that will notify your administration team via email, including the names of any dashboards that have dropped out of RAM cache.

Below is example of what that might look like.

Sign In