Create Spark Structured Streaming applications using the open source connector for Amazon Kinesis Data Streams | AWS

Spread the love

Apache Spark is a big data engine known for its in-memory computing capabilities and is commonly used for large-scale data analytics. One of the key features of Apache Spark is its ability to handle iterative algorithms and interactive queries efficiently. Apache Spark can also be used to process streaming data from various sources, including Amazon Kinesis Data Streams, for tasks like clickstream analysis and fraud detection. Amazon Kinesis Data Streams is a serverless streaming data service that simplifies the process of capturing, processing, and storing data streams of any size.

The new open-source Amazon Kinesis Data Streams Connector for Spark Structured Streaming allows users to leverage the latest Spark Data Sources API. This connector supports enhanced fan-out for dedicated read throughput and faster stream processing. It enables users to consume and produce records from and to Kinesis Data Streams using Amazon EMR.

The Kinesis Data Streams connector for Spark Structured Streaming supports both provisioned and On-Demand capacity modes provided by Kinesis Data Streams. It is built using the latest Spark Data Sources API V2, which utilizes Spark optimizations. This connector comes pre-installed on Amazon EMR 7.1 and above, making it easier to use without the need to build or download additional packages. For other Apache Spark platforms, the connector is available as a public JAR file that can be directly referenced when submitting a Spark Structured Streaming job.

The connector supports two types of consumers for Kinesis Data Streams: shared throughput and dedicated throughput (enhanced fan-out). Users have the flexibility to choose the consumer type based on their requirements without the need for additional coding. Additionally, the connector can be used as a sink connector to produce records to a Kinesis data stream, supporting multiple storage options like Amazon DynamoDB, Amazon S3, and HDFS for checkpoints and continuity.

Cross-account processing between a Kinesis data stream in one AWS account and a Spark Structured Streaming application in a different AWS account is also supported by the connector, though it requires setting up Identity and Access Management (IAM) trust policies.

Overall, the Kinesis Data Streams connector for Spark Structured Streaming simplifies the process of consuming and producing records from and to Kinesis Data Streams. It enhances the capabilities of Spark Structured Streaming and provides users with a powerful tool for building high-throughput streaming applications. The connector is open source under the Apache 2.0 license and can be accessed on GitHub for further exploration.

Article Source
https://aws.amazon.com/blogs/big-data/build-spark-structured-streaming-applications-with-the-open-source-connector-for-amazon-kinesis-data-streams/