Introduction
As a Spotfire® product manager I often get the question about how to access and analyze data in Amazon S3 and Microsoft Azure Data Lake Storage Gen2 (ADLS). As a Spotfire user you are used to self-service connectivity and probably expect to access the files and data you need when you need it. This includes loading files directly from Amazon S3 and ADLS. The files would then be loaded into Spotfire's in-memory engine. This is possible with custom extensions or scripting, for example List and Download Items from AWS S3 Buckets in Spotfire for S3, but not natively. If you are interested in having this capability available natively please vote and comment on the idea below.
Idea: Load Parquet, ORC and Avro data files from Amazon S3 and HDFS (link)
This article describes options available today based on utilizing a computing layer between Spotfire and Amazon S3 and ADLS.
Computing layer or direct file access?
First, you should ask yourself why the files are stored in S3 and ADLS? If they are extracted from traditional databases, maybe you are better off enabling access for analysts and data scientists directly to those sources or through virtual cached views. The product TIBCO® Data Virtualization is designed to provide these virtual views on top of data sources not originally designed to be a data warehouse or dimensioned for interactive analytic tasks.
But for this article let's assume that the original data source for analytics is the files in S3 and ADLS.
There are some challenges with enabling analysts and data scientist with direct file access:
- As files grow in size and numbers it will take longer to import them into Spotfire. There is no way to query data without moving it out of S3 and ADLS row by row.
- New data needs to be appended to existing files or files need to be overwritten. Otherwise business users need to replace Spotfire data tables and reload data to analyze the most recent files.
- Granular access control to rows and columns in files may be a challenge to configure and maintain.
This is when the idea of utilizing a computing layer comes into play and it has certain benefits:
Analysts connect to data tables as if the files were a traditional relational data source.
- Data engineers can prepare files as tables for analysts using cluster computing power. For example execute heavy unions and joins or data transformations.
- No need to move data around. In-database analytics enables visual data discovery on the largest data sets by pushing interactive queries into the computing layer.
- The computing layer often provides granular role based row and column access control.
- As with loading files into Spotfire, you are keeping the computing layer and storage layer separate. This means they can be scaled up and down independently based on your need to save costs or make things run faster or work with larger data sets with the same performance.
Amazon S3 computing layers
Below are some of the computing layers which offer native support for direct querying from Spotfire to files in Amazon S3. Which one you choose depends on what works best for your organization. What they have in common is that they work well and are easy to use with Spotfire. They all map S3 files to external data tables in the computing layer of your choice. The tables are then accessed using a Spotfire connector.
Amazon Athena
Amazon Athena is an interactive query service based on Presto that makes it easy to analyze data directly in Amazon S3. Athena is serverless and you pay only for the queries you run. Athena scales automatically so the Spotfire user experience should be good also with large datasets and complex queries.
Note: Spotfire currently doesn't include a native Athena or Presto connector. Please vote and comment on the Idea of adding one here. Amazon Athena and Presto connectivity is established with Spotfire Information Services and a JDBC driver. This approach enables you to extract data into Spotfire up front or on-demand but unlike connectors it doesn't support data discovery based on interactive push down queries.
Amazon Athena documentation
- What is Amazon Athena?
- Spotfire Information Designer
- Athena data source template for Spotfire Information Services
Amazon Redshift Spectrum
Amazon Redshift Spectrum extends Redshift and enables S3 data to be queried using external tables. External tables can be defined in Amazon Redshift, AWS Glue Data Catalog, Amazon Athena, or an Apache Hive metastore.
Amazon Redshift Spectrum documentation
- Getting Started with Amazon Redshift Spectrum
- Creating External Tables for Amazon Redshift Spectrum
- Spotfire Connector for Amazon Redshift
- Spotfire Custom queries
Apache Drill
Apache Drill supports running queries on S3 files directly as tables or as custom views defined in SQL.
Apache Drill documentation
- S3 Storage Plugin for Apache Drill
- Running SQL Queries on Amazon S3 (blog and video)
- Spotfire Connector for Apache Drill
Apache Hive
In general we have seen better interactive query performance with other SQL engines than with Hive. If you choose to use Hive, it's highly recommended to use Tez as underlying engine and to enable LLAP. Hive provides a few alternatives for S3 connectivity:
- An external Hive table can represent S3 data (aligned with most other computing layers)
- HDFS can be replaced with S3 (increased separation of storage and computing layers)
- S3 data can be moved to HDFS and queried using Hive tables.
When using external tables data is temporarily stored in HDFS when queried.
Apache Hive documentation
Apache Spark SQL
Apache Spark SQL reads S3 data using the hadoop-aws connector in Hadoop or vendor specific connectors. For more information please refer to the Spark SQL - Integration with Cloud Infrastructures documentation.
Apache Spark SQL documentation
- Spark SQL - Integration with Cloud Infrastructures
- Hadoop-AWS module: Integration with Amazon Web Services
- Databricks - Amazon S3
- Cloudera: Accessing Data Stored in Amazon S3 through Spark
- Spotfire Connector for Apache Spark SQL
Dremio
Dremio can query S3 data with options to load data asynchronously, and cache and refresh data on a schedule.
Dremio documentation
Google BigQuery Omni (currently in private alpha)
Goolge BigQuery Omni enables you to connect to S3 data and query it without moving data out of AWS.
Google BigQuery documentation
Impala
From the Impala documentation linked below: ?Impala can designate S3-based tables as either internal (managed by Impala) or external, by using the syntax CREATE TABLE or CREATE EXTERNAL TABLE respectively. When you drop an internal table, the files associated with the table are removed, even if they are on S3 storage. When you drop an external table, the files associated with the table are left alone, and are still available for access by other tools or components.?
Impala documentation
MemSQL
MemSQL uses Pipelines to access files in S3, HDFS, Azure Blob and file systems. Background Pipelines continuously add data from new and updated S3 files to the corresponding MemSQL databases and tables.
MemSQL documentation
Snowflake
Snowflake loads data from S3 and other external sources using external tables, in bulk using the copy command or continuously in micro batches using Snowpipes.
Snowflake documentation
Teradata Vantage NOS
Teradata Vantage 2.0 introduced support for external data with the Native Object Store (NOS) feature.
Teradata documentation
Microsoft Azure Data Lake Store Gen 2 computing layers
Microsoft Azure Synapse Analytics
You can use Spotfire's connector for Azure Synapse Analytics to query data in ADLS.
ADLS > COPY > Microsoft Azure Synapse Analytics
ADLS > Azure Databricks > Azure Synapse Analytics
ADLS > Azure Data Factory > Azure Synapse Analytics
Cloudera Hive and Impala
You can use Spotfire's Hive and Impala connectors to query data in ADLS.
Starburst (Prest/Trino)
You can use Spotfire Information Links (JDBC) to query ADLS data via Starburst.
Teradata NOS
You can query ADLS data using the Spotfire connector for Teradata.
TIBCO Data Virtualization (TDV)
You can use Spotfire's connector for TDV to query data in ADLS.
Considerations
These are some of the considerations when adding a computing layer:
- Caching of S3/ADLS files into cluster in-memory tables. This may improve the Spotfire user experience by reducing query times on large data compared to if large S3/ADLS files are queried directly.
- The level of self service for analysts is reduced compared to a future capability that could load files directly into Spotfire. Either a data engineer needs to prepare external tables for analysts or analysts need permissions to do this task themselves.
- Increased cost and complexity compared to maintaining only S3/ADLS and Spotfire.
- 1
Recommended Comments
There are no comments to display.