How Slack achieved operational excellence for Spark on Amazon EMR using generative AI | Amazon Web Services

vm_admin

4 months ago

How Slack achieved operational excellence for Spark on Amazon EMR using generative AI | Amazon Web Services

At Slack, our data platform processes terabytes of data each day using Apache Spark on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2), powering the insights that drive strategic decision-making across the organization.

As our data volume expanded, so did our performance challenges. With traditional monitoring tools, we couldn’t effectively manage our systems when Spark jobs slowed down or costs spiraled out of control. We were stuck searching through cryptic logs, making educated guesses about resource allocation, and watching our engineering teams spend hours on manual tuning that should have been automated. That’s why we built something better: a detailed metrics framework designed specifically for Spark’s unique challenges. This is a visibility system that gives us granular insights into application behavior, resource usage, and job-level performance patterns we never had before. We’ve achieved 30–50% cost reductions and 40–60% faster job…

https://aws.amazon.com/blogs/big-data/how-slack-achieved-operational-excellence-for-spark-on-amazon-emr-using-generative-ai/