As organizations scale their generative AI workloads on Amazon Bedrock, operational visibility into inference performance and resource consumption becomes critical. Teams running latency-sensitive applications must understand how quickly models begin generating responses. Teams managing high-throughput workloads must understand how their requests consume quota so they can avoid unexpected throttling. Until now, gaining this visibility required custom client-side instrumentation or reactive troubleshooting after issues occurred.
Today, we’re announcing two new Amazon CloudWatch metrics for Amazon Bedrock, TimeToFirstToken and EstimatedTPMQuotaUsage. These metrics give you server-side visibility into streaming latency and quota consumption. These metrics are automatically emitted for every successful inference request at no additional cost, with no API changes or opt-in required. They are available now in the AWS/Bedrock CloudWatch namespace.
In this post, we…