The Dynamo VLLM runtime container is a pre-built, Docker-based environment designed to run NVIDIA Dynamo with the vLLM backend for high-performance, distributed large language model (LLM) inference. It packages all necessary dependencies, runtime components, and optimizations to streamline deployment and ensure consistency across development and production environments.
Key Components
vLLM Backend: Provides fast, efficient LLM inference, leveraging vLLM’s optimized attention and KV cache management.
Dynamo Core Services: Includes the HTTP API server, request router, and worker processes for prefill and decode phases.
Supporting Services: Integrates with etcd and NATS for distributed coordination and messaging.
OpenAI-Compatible Frontend: Exposes an HTTP API compatible with OpenAI’s endpoints for easy integration.
For more information about Dynamo features, please refer to the Github repository
Select the Tags tab and locate the container image release that you want to run.
In the Pull Tag column, click the icon to copy the docker pull command.
Open a command prompt and paste the pull command. The pulling of the container image begins. Ensure the pull completes successfully before proceeding to the next step.
Start required services (etcd and NATS) using Docker Compose:
docker compose -f deploy/docker-compose.yml up -d
Run the container image and start dynamo via:
dynamo run out=vllm deepseek-ai/DeepSeek-R1-Distill-Llama-8B
For more examples, please refer to the examples directory in the repository.
Please refer to the following support matrix to learn more about the current hardware & architecture support. Dynamo currently only provides pre-built x86_64 containers.
NVIDIA Dynamo is released under an open-source license, Apache-2.0, making it freely available for development, research, and deployment.
GitHub Issues: Dynamo GitHub Issues