NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes | NVIDIA Technical Blog

NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes | NVIDIA Technical Blog

By Schwinn Saereesitthipitak
Publication Date: 2026-05-27 23:09:00

The cold-start problem

In production inference deployments, demand fluctuates over time, requiring inference replicas to scale elastically. However, cold-starting inference workloads on Kubernetes can take several minutes. During that time, GPUs are allocated but idle, generating no tokens and serving no requests.

This delay increases the risk of service level agreement (SLA) violations during traffic spikes, as the system cannot scale quickly enough to absorb sudden increases in demand.

For a single-GPU vLLM (v0.20.0) workload, the cold-start latency breaks down as follows:

To significantly reduce startup time, we are introducing NVIDIA Dynamo Snapshot, our checkpoint/restore approach for AI inference workloads on Kubernetes. In this post, we describe the design choices and optimizations behind our early prototype, which achieves startup times close to the speed of light for single-GPU workloads.

This is the first post in a series on fast startup in Dynamo.

CRIU and cuda-checkpoint

A running inference worker’s checkpointable state has two components:

  • Device state (GPU-side): CUDA contexts, streams, device memory, virtual address mappings, etc. This is not visible to the host. To serialize this state, we use the checkpointing capability of the CUDA driver (which is also exposed by the cuda-checkpoint command line tool) to dump the device state to CPU…