NVIDIA
NVIDIA
Dynamo Tensorrt-LLM Runtime
Container
NVIDIA
NVIDIA
Dynamo Tensorrt-LLM Runtime

The Dynamo TensorRT-LLM runtime image is a containerized build of Dynamo + TensorRT-LLM which serves as the base runtime environment for tensorrt-llm based inference with Dynamo's distributed inference framework.

Overview

The Dynamo TensorRT-LLM runtime container is a pre-built, Docker-based environment designed to run NVIDIA Dynamo with the TensorRT-LLM backend for maximum inference performance on NVIDIA GPUs. It packages all necessary dependencies, runtime components, and optimizations to streamline deployment and ensure consistency across development and production environments. Quick Links: Key Components | Release Info | Getting Started | Support

Key Components

  • TensorRT-LLM Backend: Open-source library for optimizing Large Language Model (LLM) inference with state-of-the-art optimizations for maximum performance on NVIDIA GPUs.
  • Disaggregated Serving (P/D): Separates prefill and decode phases across specialized workers for improved throughput and latency optimization.
  • Planner: SLA-aware request scheduling that routes requests based on latency targets and system load.
  • KV Router: Intelligent request routing with prefix-aware caching to maximize KV cache reuse across workers.
  • NIXL (KV Transfer Library): High-performance GPU-to-GPU memory transfer for distributed KV cache operations.
  • OpenAI-Compatible Frontend: HTTP API server compatible with OpenAI's chat completions and completions endpoints.
  • Kubernetes-Native Infrastructure: Service discovery via EndpointSlices and transport-agnostic request plane (TCP default) enable deployment without external dependencies. etcd and NATS remain available as optional alternatives for non-Kubernetes environments. For more information about Dynamo features, please refer to the GitHub repository and documentation.

Release Info

For the complete release history including TensorRT-LLM versions, CUDA support, and architecture details, see the Release Artifacts page. Pre-built containers are available for both x86_64 (AMD64) and ARM64 architectures.

Getting Started

  1. Select the Tags tab and locate the container image release that you want to run.
  2. In the Pull Tag column, click the icon to copy the docker pull command.
  3. Open a command prompt and paste the pull command. Ensure the pull completes successfully.
  4. Run the container:
docker run --gpus all -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:<version>

For next steps, including deployment options and examples, please refer to the Dynamo README.

Support Matrix

Please refer to the support matrix for detailed hardware, architecture, and model support information.

Related Containers

License

NVIDIA Dynamo is released under the Apache-2.0 open-source license, making it freely available for development, research, and deployment.

Technical Support

Publisher
NVIDIA
NVIDIA
Latest Tag1.2.1
UpdatedJune 13, 2026 UTC
Compressed Size18.34 GB
Multinode SupportNo
Multi-Arch SupportYes

NVIDIA uses cookies to improve your experience on our web site. We and our third-party partners also use cookies and other tools to collect and record information you provide as well as information about your interactions with our websites for performance improvement, analytics, and to assist in marketing efforts. By clicking "Accept All", you consent to our use of cookies and other tools as described in our Cookie Policy. You can manage your cookie settings by clicking on "Manage Settings." By continuing to use this site or by clicking one of the buttons below, you agree to our Terms of Service (which contains important waivers). Please see our Privacy Policy for more information on our privacy practices.