Grid Monitoring Accelerator

Introduction

The Grid Monitoring Accelerator provides a reference architecture and code assets for monitoring and managing computational data grids. It makes use of rule processing and data science models to alert and predict anomalies before they cause issues with completing a processing run, allowing operational staff the opportunity to intervene in a timely manner.

Download

The Grid Monitoring Accelerator can be downloaded from the Exchange.

Business Scenario

A data grid is a software architecture that allows for highly distributed processing. It is often applied in situations where there are large amounts of data, and computations can be broken down into small, individual units of work. The individual computation results are then aggregated together to produce a final computed result. Data grids can be located on a single site with many physical or virtual machines, or geographically distributed. Monitoring and managing the performance of data grids is a complex problem.

Data grids are managed by supervising software, such as TIBCO GridServer®. They can capture telemetry about the performance of individual engines, brokers, and drivers that compose the grid, and present this information for analytics purposes. This telemetry can provide insights into the grid health and performance.

Concepts

The Accelerator was written specifically with TIBCO GridServer® as the data source, but principles can apply to any generic grid supervisor, provided data can be provided in the correct format.

For TIBCO GridServer®, the following components are involved:

Grid Client-- these are the components that submit service requests into the grid, also known as Drivers
Engines -- processes that host and run services on grid nodes, the workers
Brokers -- provide request queuing, scheduling, and load-balancing, as well as Engine management
Directors -- component that assign Grid Clients to Brokers based on policies, such as what are the installed capabilities of the Broker's Engines and how busy are the Engines

The Accelerator captures telemetry from each of these components and transforms it into a standard data format. The data can then be viewed on live dashboards implemented using TIBCO Spotfire®. In addition, the Accelerator builds a task state model for each of the submitted tasks. There are 3 different task notifications used to determine state:

Task Submitted -- the task has been submitted to the grid for processing
Task Assigned -- the task has been allocated to an engine for execution
Task Completed -- the engine has completed executing the task

Under normal processing these 3 events will occur in sequence in a timely manner. If there is a gap between Submitted and Assigned this means the task was queued, and the grid was too busy to accept it at this time. Tasks can also experience rescheduling and reassigning, both of which are indicators of non-optimal grid health.

Since data grids produce different types of events, with many dozens of parameters per individual event, it becomes difficult to manually inspect the data, or even build simple rules-based systems to detect anomalies. The use of data science models can automate this process through the use of anomaly detection models. By using unsupervised model techniques against grid data streams, outliers can be identified and flagged to operations staff for investigation.

Benefits and Business Value

Data grids are used for complex calculations in large global financial institutions. These platforms are critical for nightly reconciliation of positions and reporting to government regulators. Failure to report in a timely manner can result in fines and costly adverse publicity.

When grids go wrong, it's often a difficult task to detect this early enough to take corrective action. Since the underlying engines are executing code created by data analysts and programmers, it is subject to the same quality control issues as any other piece of software. Memory leaks, crashing nodes, and incomplete calculations are all issues that can adversely impact grid health. The Accelerator provides an intelligent platform for capturing grid telemetry and presenting it to operations staff in a manner to flag potential issues before they consume a large amount of time and processing power.

Technical Scenario

The Accelerator demonstrates grid monitoring using a recorded dataset produced from a real TIBCO GridServer® implementation. Using a recorded dataset allows users to try out the Accelerator without having to spin up an entire data grid. In a real implementation an integration between the data grid and the Accelerator would be necessary.

A Spotfire® dashboard is provided to show key grid metrics and task states. The Accelerator also executes an anomaly detection model in Python to produce an anomaly score called Loss MAE. Once this value exceeds a configurable threshold, the grid state is declared to be anomalous and this is a flag to operations to begin investigating activities.

Sign In

Grid Monitoring Accelerator

Introduction

Download

Business Scenario

Concepts

Benefits and Business Value

Technical Scenario

Components

Table of contents

User Feedback

Recommended Comments

Industries