NVIDIA

Dynamo SGLang Runtime

Container

NVIDIA

Dynamo SGLang Runtime

The Dynamo SGLang runtime image is a containerized build of Dynamo + SGLang which serves as the base runtime environment for sglang based inference with Dynamo's distributed inference framework.

Overview

The Dynamo SGLang runtime container is a pre-packaged, Docker-based environment tailored for running NVIDIA Dynamo with the SGLang backend for high-performance, modular large language model (LLM) inference and serving. It packages all necessary dependencies, runtime components, and optimizations to streamline deployment and ensure consistency across development and production environments. Quick Links: Key Components | Release Info | Getting Started | Support

Key Components

SGLang Backend: Fast serving framework for large language models and vision language models with co-designed backend runtime and frontend language for faster, more controllable model interaction.
Disaggregated Serving (P/D): Separates prefill and decode phases across specialized workers for improved throughput and latency optimization.
Planner: SLA-aware request scheduling that routes requests based on latency targets and system load.
KV Router: Intelligent request routing with prefix-aware caching to maximize KV cache reuse across workers.
NIXL (KV Transfer Library): High-performance GPU-to-GPU memory transfer for distributed KV cache operations.
OpenAI-Compatible Frontend: HTTP API server compatible with OpenAI's chat completions and completions endpoints.
Kubernetes-Native Infrastructure: Service discovery via EndpointSlices and transport-agnostic request plane (TCP default) enable deployment without external dependencies. etcd and NATS remain available as optional alternatives for non-Kubernetes environments. For more information about Dynamo features, please refer to the GitHub repository and documentation.

Release Info

For the complete release history including SGLang versions, CUDA support, and architecture details, see the Release Artifacts page. Pre-built containers are available for both x86_64 (AMD64) and ARM64 architectures. CUDA 13 experimental variants are also available.

Getting Started

Select the Tags tab and locate the container image release that you want to run.
In the Pull Tag column, click the icon to copy the docker pull command.
Open a command prompt and paste the pull command. Ensure the pull completes successfully.
Run the container:

docker run --gpus all -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:<version>

For next steps, including deployment options and examples, please refer to the Dynamo README.

Support Matrix

Please refer to the support matrix for detailed hardware, architecture, and model support information.

Related Containers

vLLM Runtime - Broadest model and feature coverage
TensorRT-LLM Runtime - Maximum inference performance
Dynamo Frontend - Standalone frontend with EndpointPicker (EPP)
Kubernetes Operator - K8s deployment automation

License

NVIDIA Dynamo is released under the Apache-2.0 open-source license, making it freely available for development, research, and deployment.

Technical Support

Documentation: Dynamo Documentation
GitHub Issues: Dynamo GitHub Issues
Release Notes: GitHub Releases

Publisher

NVIDIA

Latest Tag1.2.1-efa-amd64

UpdatedJune 13, 2026 UTC

Compressed Size15.89 GB

Multinode SupportNo

Multi-Arch SupportYes

System

signed images

Labels

AI DL High Performance Computing Inference Infrastructure Software ML NSPECT-9EST-K1WZ NVIDIA AI