Evaluating AI agents: Real-world lessons from building agentic systems at Amazon | Amazon Web Services

Evaluating AI agents: Real-world lessons from building agentic systems at Amazon | Amazon Web Services

The generative AI industry has undergone a significant transformation from using large language model (LLM)-driven applications to agentic AI systems, marking a fundamental shift in how AI capabilities are architected and deployed. While early generative AI applications primarily relied on LLMs to directly generate text and respond to prompts, the industry has evolved from those static, prompt-response paradigms toward autonomous agent frameworks to build dynamic, goal-oriented systems capable of tool orchestration, iterative problem-solving, and adaptive task execution in production environments.

We have witnessed this evolution in Amazon; since 2025, there have been thousands of agents built across Amazon organizations. While single-model benchmarks serve as a crucial foundation for assessing individual LLM performance in LLM-driven applications, agentic AI systems require a fundamental shift in evaluation methodologies. The new paradigm assesses not only the…

https://aws.amazon.com/blogs/machine-learning/evaluating-ai-agents-real-world-lessons-from-building-agentic-systems-at-amazon/