Amazon EMR for Apache Spark Runtime is a performance-optimized runtime that is 100% API compatible with open-source Apache Spark. It offers faster out-of-the-box performance than Apache Spark through enhanced query plans, faster queries, and tuned defaults. This optimized runtime is utilized by Amazon EMR on EC2, Amazon EMR serverless, Amazon EMR on Amazon EKS, and advanced AWS instances, delivering a performance boost 4.5 times faster than Apache Spark 3.5.1 with a 2.8 times better price-performance ratio, based on an industry-standard TPC-DS benchmark at a 3 TB scale.
With 35 optimizations added since EMR 6.9, these improvements are now available in EMR 7.0 and 7.1, offering enhanced physical plan operator improvements, improved query planning, and reduced requests to Amazon S3. Java 17 serves as the default Java runtime for Amazon EMR 7.0, maximizing performance after rigorous testing and tuning.
Comparative testing between Amazon EMR 7.1, Apache Spark 3.5.1, and EMR 6.9 using TPC-DS 3TB dataset showcased significant performance enhancements, with Amazon EMR 7.1 outperforming the other versions by 1.9 and 4.5 times, respectively. This is further reflected in cost comparisons, with Amazon EMR 7.1 demonstrating a cost saving improvement of 2.8 times over Apache Spark 3.5.1 and 1.7 times over EMR 6.9.
Running benchmarking tests with Apache Spark 3.5.1 and Amazon EMR Spark involved configuring EC2 clusters with performance metrics and running TPC-DS benchmark tests across both platforms. Results demonstrated the superior performance of Amazon EMR 7.1 over Apache Spark 3.5.1 in terms of runtime and cost efficiency.
For those interested in running TPC-DS benchmarking tests with Amazon EMR Spark, the process involves setting up clusters, submitting jobs, and analyzing results. Detailed instructions are provided for deployment, execution, and result analysis.
In conclusion, Amazon EMR continues to enhance EMR runtime for Apache Spark, delivering consistent year-over-year performance improvements. Subscribing to the Big Data Blog RSS Feed is recommended to stay updated on the latest performance enhancements, configuration practices, and tuning tips for EMR Apache Spark runtime. Ashok Chintalapati and Steve Kooncé from Amazon Web Services contribute their expertise to the ongoing development and improvement of Amazon EMR for Apache Spark.
Article Source
https://aws.amazon.com/blogs/big-data/run-apache-spark-3-5-1-workloads-4-5-times-faster-with-amazon-emr-runtime-for-apache-spark/