AWS Glue Data Catalog now offers column-level aggregation statistics for Apache Iceberg tables, enhancing query performance in Redshift Spectrum. Apache Iceberg is an open table format that supports ACID transactions on data lakes. Enhancements include time-travel, schema evolution, and hidden partitioning. The Data Catalog stores table metadata and supports Iceberg tables, enabling automatic compaction of files for faster operations. Column statistics for Iceberg tables are based on the Puffin Spec algorithm and stored on Amazon S3 for use by various engines.
Column statistics for Iceberg tables are generated by the Theta Sketch algorithm on Apache DataSketches to estimate the number of distinct values (NDV). NDV is essential for optimizing query planning. Theta Sketch is an efficient algorithm that estimates NDV without needing to store all distinct values, reducing space complexity. The Puffin file in Iceberg stores statistics as blobs using the Theta Sketch algorithm to optimize query plans.
To demonstrate the benefits of column statistics, tests were conducted with Redshift Spectrum using the TPC-DS dataset. The tests showed a performance improvement of 31.5-489.1% on ten selected queries. Detailed comparisons were made, showcasing the impact of column statistics on query plans.
To automate the process of generating Iceberg table column statistics, AWS offers Lambda functions and EventBridge Scheduler. This automation ensures up-to-date statistics without manual intervention. The process involves configuring Lambda functions to run statistics and setting up time-based schedules for regular runs.
After completion, it’s essential to clean up resources to maintain efficiency. The new feature in the Data Catalog improves query performance and potential cost savings. Users are encouraged to try out this feature and provide feedback for further enhancements. Visit the AWS Glue Catalog documentation for more information.
The authors, including Solutions Architect Sotaro Hikita and Principal Big Data Architect Noritaka Sekiyama, have worked on various big data technologies and software artifacts to support customers. They are focused on improving query performance and building efficient systems for data lakes. Senior Software Development Engineer Kyle Duong and Senior Product Manager Sandeep Adwankar bring their expertise in building big data technologies and translating business requirements into products.
Article Source
https://aws.amazon.com/blogs/big-data/accelerate-query-performance-with-apache-iceberg-statistics-on-the-aws-glue-data-catalog/