NVIDIA Dynamo Planner Brings SLO-Driven Automation to Multi-Node LLM Inference

vm_admin

6 months ago

NVIDIA Dynamo Planner Brings SLO-Driven Automation to Multi-Node LLM Inference

By Claudio Masolo
Publication Date: 2026-01-31 09:00:00

Microsoft and NVIDIA have released Part 2 of their collaboration on running NVIDIA Dynamo for large language model inference on Azure Kubernetes Service (AKS). The first announcement aimed for a raw throughput of 1.2 million tokens per second on distributed GPU systems. Now, this latest release focuses on helping developers work faster and improving operational efficiency. It does this through automated resource planning and dynamic scaling features.

The new capabilities center on two integrated components: the Dynamo Planner Profiler and the SLO-based Dynamo Planner. These tools work together to solve the “rate matching” challenge in disaggregated serving. The teams use this term when they split inference workloads. They separate prefill operations, which process the input context, from decode operations that generate output tokens. These tasks run on different GPU pools. Without the right tools, teams spend a lot of time determining the optimal GPU allocation for these phases.

The Dynamo Planner Profiler is a pre-deployment simulation tool. It automates the search for the best configurations. Developers can skip manually testing various parallelization strategies and GPU counts, saving hours of GPU utilization. Instead, they define their needs in a DynamoGraphDeploymentRequest (DGDR) manifest. The profiler runs an automated sweep of the configuration space. It tests different tensor parallelism sizes for both prefill and decode stages. This helps find…