NVIDIA
NVIDIA
Dynamo Snapshot-Agent
Container
NVIDIA
NVIDIA
Dynamo Snapshot-Agent

Dynamo Snapshot Agent enables CRIU-based checkpoint and restore for GPU inference workloads running on NVIDIA Dynamo

Overview

The Dynamo Snapshot Agent container is a pre-built, Docker-based Kubernetes DaemonSet image that enables CRIU-based checkpoint and restore for GPU inference workloads running on NVIDIA Dynamo. It dramatically reduces cold-start times for large models from minutes to seconds by capturing initialized application state (model loaded on GPU) and restoring it on-demand into new pods.
Quick Links: Key Components | Release Info | Getting Started | Support

Experimental Feature: Dynamo Snapshot is currently in preview. The DaemonSet runs in privileged mode to perform CRIU operations. See Limitations for details.

Key Components

  • CRIU (Checkpoint/Restore in User-space): Process-level checkpoint and restore engine (v4.2) that captures full application state, including memory, file descriptors, and process trees.
  • NVIDIA cuda-checkpoint: GPU state checkpoint and restore utility that works alongside CRIU to capture and restore CUDA contexts, allocations, and device state.
  • Snapshot Agent: Go-based DaemonSet binary that watches for checkpoint-source and restore-target pods via Kubernetes labels, orchestrates the CRIU dump and cuda-checkpoint workflows, and writes checkpoint tars to shared storage.
  • nsrestore: Companion binary that runs inside placeholder containers via nsenter to apply rootfs overlays and execute CRIU and CUDA restore operations.
  • Kubernetes-Native Workflow: Integrates with the Dynamo Operator via DynamoCheckpoint Custom Resources and pod labels (nvidia.com/snapshot-is-checkpoint-source, nvidia.com/snapshot-is-restore-target) for fully automated checkpoint lifecycle management.
  • Helm Chart: Namespace-scoped Helm chart installs the DaemonSet, checkpoint storage PVC, RBAC, and seccomp profile. For more information about Dynamo Snapshot, please refer to the Snapshot documentation and the GitHub repository.

Release Info

For the complete release history including CUDA support and architecture details, see the Release Artifacts page. The snapshot agent container is available for x86_64 (AMD64) architecture only (cuda-checkpoint does not have an ARM64 binary).

Getting Started

  1. Select the Tags tab and locate the container image release that you want to run.
  2. In the Pull Tag column, click the icon to copy the docker pull command.
  3. Open a command prompt and paste the pull command. Ensure the pull completes successfully.
  4. Deploy the snapshot agent on your Kubernetes cluster using the Helm chart:
helm upgrade --install snapshot oci://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-snapshot \
  --namespace ${NAMESPACE} \
  --create-namespace \
  --set storage.pvc.create=true

For next steps, including checkpoint configuration, DynamoCheckpoint CRD usage, and end-to-end restore workflows, please refer to the Snapshot guide.

Prerequisites

  • Dynamo Platform/Operator installed on a Kubernetes cluster with x86_64 (AMD64) GPU nodes
  • NVIDIA driver 580.xx or newer
  • containerd runtime
  • vLLM or SGLang backend (TensorRT-LLM is not supported yet)
  • ReadWriteMany storage for cross-node restore
  • Security clearance to run a privileged DaemonSet with hostPID, hostIPC, and hostNetwork

Limitations

  • LLM workers only: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
  • Single-GPU only: Multi-GPU configurations may work in basic hardware configurations but are not officially supported yet.
  • Network state: Active TCP connections cannot be checkpointed.
  • Architecture: x86_64 (AMD64) only — cuda-checkpoint does not have an ARM64 binary.
  • Security: Runs as a privileged DaemonSet (required for CRIU and cuda-checkpoint). Workload pods do not need to be privileged.

Support Matrix

Please refer to the support matrix and feature matrix for detailed hardware, architecture, and backend support information.

BackendSnapshot Support
vLLMSupported
SGLangSupported
TensorRT-LLMNot yet supported

Related Containers

License

NVIDIA Dynamo is released under the Apache-2.0 open-source license, making it freely available for development, research, and deployment.

Technical Support

Publisher
NVIDIA
NVIDIA
Latest Tag1.2.1
UpdatedJune 13, 2026 UTC
Compressed Size4.95 GB
Multinode SupportNo
Multi-Arch SupportYes