Amazon S3 and Azure ADLS data in Spotfire®

Introduction

As a Spotfire^® product manager I often get the question about how to access and analyze data in Amazon S3 and Microsoft Azure Data Lake Storage Gen2 (ADLS). As a Spotfire user you are used to self-service connectivity and probably expect to access the files and data you need when you need it. This includes loading files directly from Amazon S3 and ADLS. The files would then be loaded into Spotfire's in-memory engine. This is possible with custom extensions or scripting, for example List and Download Items from AWS S3 Buckets in Spotfire for S3, but not natively. If you are interested in having this capability available natively please vote and comment on the idea below.

Idea: Load Parquet, ORC and Avro data files from Amazon S3 and HDFS (link)

This article describes options available today based on utilizing a computing layer between Spotfire and Amazon S3 and ADLS.

Computing layer or direct file access?

First, you should ask yourself why the files are stored in S3 and ADLS? If they are extracted from traditional databases, maybe you are better off enabling access for analysts and data scientists directly to those sources or through virtual cached views. The product TIBCO^® Data Virtualization is designed to provide these virtual views on top of data sources not originally designed to be a data warehouse or dimensioned for interactive analytic tasks.

But for this article let's assume that the original data source for analytics is the files in S3 and ADLS.

There are some challenges with enabling analysts and data scientist with direct file access:

As files grow in size and numbers it will take longer to import them into Spotfire. There is no way to query data without moving it out of S3 and ADLS row by row.
New data needs to be appended to existing files or files need to be overwritten. Otherwise business users need to replace Spotfire data tables and reload data to analyze the most recent files.
Granular access control to rows and columns in files may be a challenge to configure and maintain.

This is when the idea of utilizing a computing layer comes into play and it has certain benefits:

Analysts connect to data tables as if the files were a traditional relational data source.

Data engineers can prepare files as tables for analysts using cluster computing power. For example execute heavy unions and joins or data transformations.
No need to move data around. In-database analytics enables visual data discovery on the largest data sets by pushing interactive queries into the computing layer.
The computing layer often provides granular role based row and column access control.
As with loading files into Spotfire, you are keeping the computing layer and storage layer separate. This means they can be scaled up and down independently based on your need to save costs or make things run faster or work with larger data sets with the same performance.

Amazon S3 computing layers

Below are some of the computing layers which offer native support for direct querying from Spotfire to files in Amazon S3. Which one you choose depends on what works best for your organization. What they have in common is that they work well and are easy to use with Spotfire. They all map S3 files to external data tables in the computing layer of your choice. The tables are then accessed using a Spotfire connector.

Amazon Athena

Amazon Athena is an interactive query service based on Presto that makes it easy to analyze data directly in Amazon S3. Athena is serverless and you pay only for the queries you run. Athena scales automatically so the Spotfire user experience should be good also with large datasets and complex queries.

Note: Spotfire currently doesn't include a native Athena or Presto connector. Please vote and comment on the Idea of adding one here. Amazon Athena and Presto connectivity is established with Spotfire Information Services and a JDBC driver. This approach enables you to extract data into Spotfire up front or on-demand but unlike connectors it doesn't support data discovery based on interactive push down queries.

Amazon Athena documentation

Amazon Redshift Spectrum

Amazon Redshift Spectrum extends Redshift and enables S3 data to be queried using external tables. External tables can be defined in Amazon Redshift, AWS Glue Data Catalog, Amazon Athena, or an Apache Hive metastore.

Amazon Redshift Spectrum documentation

Apache Drill

Apache Drill supports running queries on S3 files directly as tables or as custom views defined in SQL.

Apache Drill documentation

Apache Hive

In general we have seen better interactive query performance with other SQL engines than with Hive. If you choose to use Hive, it's highly recommended to use Tez as underlying engine and to enable LLAP. Hive provides a few alternatives for S3 connectivity:

An external Hive table can represent S3 data (aligned with most other computing layers)
HDFS can be replaced with S3 (increased separation of storage and computing layers)
S3 data can be moved to HDFS and queried using Hive tables.

When using external tables data is temporarily stored in HDFS when queried.

Apache Hive documentation

Apache Spark SQL

Apache Spark SQL reads S3 data using the hadoop-aws connector in Hadoop or vendor specific connectors. For more information please refer to the Spark SQL - Integration with Cloud Infrastructures documentation.

Apache Spark SQL documentation

Dremio

Dremio can query S3 data with options to load data asynchronously, and cache and refresh data on a schedule.

Dremio documentation

Google BigQuery Omni (currently in private alpha)

Goolge BigQuery Omni enables you to connect to S3 data and query it without moving data out of AWS.

Google BigQuery documentation

Impala

From the Impala documentation linked below: ?Impala can designate S3-based tables as either internal (managed by Impala) or external, by using the syntax CREATE TABLE or CREATE EXTERNAL TABLE respectively. When you drop an internal table, the files associated with the table are removed, even if they are on S3 storage. When you drop an external table, the files associated with the table are left alone, and are still available for access by other tools or components.?

Microsoft Azure Synapse Analytics

You can use Spotfire's connector for Azure Synapse Analytics to query data in ADLS.

Cloudera Hive and Impala

You can use Spotfire's Hive and Impala connectors to query data in ADLS.

https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cm_adls_...

Starburst (Prest/Trino)

You can use Spotfire Information Links (JDBC) to query ADLS data via Starburst.

https://docs.starburst.io/latest/connector/hive-azure.html

Teradata NOS

You can query ADLS data using the Spotfire connector for Teradata.

https://www.teradata.com/Blogs/Getting-Started-with-Native-Object-Store-...

TIBCO Data Virtualization (TDV)

You can use Spotfire's connector for TDV to query data in ADLS.

https://www.cdata.com/kb/tech/azuredatalake-tdv-setup.rst

Considerations

These are some of the considerations when adding a computing layer:

Caching of S3/ADLS files into cluster in-memory tables. This may improve the Spotfire user experience by reducing query times on large data compared to if large S3/ADLS files are queried directly.
The level of self service for analysts is reduced compared to a future capability that could load files directly into Spotfire. Either a data engineer needs to prepare external tables for analysts or analysts need permissions to do this task themselves.
Increased cost and complexity compared to maintaining only S3/ADLS and Spotfire.

Amazon S3 and Azure ADLS data in Spotfire®

Introduction

Microsoft Azure Synapse Analytics

ADLS > COPY > Microsoft Azure Synapse Analytics

ADLS > Azure Databricks > Azure Synapse Analytics

ADLS > Azure Data Factory > Azure Synapse Analytics

Cloudera Hive and Impala

Starburst (Prest/Trino)

Teradata NOS

TIBCO Data Virtualization (TDV)

Table of contents

User Feedback

Recommended Comments