Run Apache Spark and Apache Iceberg write jobs 2x faster with Amazon EMR | Amazon Web Services

Run Apache Spark and Apache Iceberg write jobs 2x faster with Amazon EMR | Amazon Web Services

Amazon EMR runtime for Apache Spark offers a high-performance runtime environment while maintaining API compatibility with open source Apache Spark and Apache Iceberg table format. Amazon EMR on EC2, Amazon EMR Serverless, Amazon EMR on Amazon EKS, Amazon EMR on AWS Outposts and AWS Glue use the optimized runtimes.

In this post, we demonstrate the write performance benefits of using the Amazon EMR 7.12 runtime for Spark and Iceberg compares to open source Spark 3.5.6 with Iceberg 1.10.0 tables on a 3TB merge workload.

Write Benchmark Methodology

Our benchmarks demonstrate that Amazon EMR 7.12 can run 3TB merge workloads over 2 times faster than open source Spark 3.5.6 with Iceberg 1.10.0, delivering significant improvements for data ingestion and ETL pipelines while providing the advanced features of Iceberg including ACID transactions, time travel, and schema evolution.

Benchmark workload

To evaluate the write performance improvements in…

https://aws.amazon.com/blogs/big-data/run-apache-spark-and-apache-iceberg-write-jobs-2x-faster-with-amazon-emr/