vLLM

NVIDIA

vLLM

Container

NVIDIA

vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. The NVIDIA vLLM NGC Container is optimized for GPU acceleration, and contains a validated set of libraries that enable and optimize GPU performance.

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

NGC Containers are one of the easiest ways to get started with vLLM. The vLLM NGC Container comes with all dependencies included, providing an easy place to start developing and deploying common applications, such as conversational AI, natural language processing (NLP), recommenders, and computer vision.

The vLLM NGC Container is optimized for GPU acceleration, and contains a validated set of libraries that enable and optimize GPU performance.

Prerequisites

Using the vLLM NGC Container requires the host system to have the following installed:

For supported versions, see the Framework Containers Support Matrix and the NVIDIA Container Toolkit Documentation.

No other installation, compilation, or dependency management is required. It is not necessary to install the NVIDIA CUDA Toolkit.

Running vLLM Using Docker

To run a container, issue the appropriate command as explained in the Running A Container chapter in the NVIDIA Containers For Deep Learning Frameworks User’s Guide and specify the registry, repository, and tags. For more information about using NGC, refer to the NGC Container User Guide.

If you have Docker 19.03 or later, a typical command to launch the container is:

docker run --gpus all -it --rm nvcr.io/nvidia/vllm:xx.xx-py3

If you have Docker 19.02 or earlier, a typical command to launch the container is:

nvidia-docker run -it --rm -v nvcr.io/nvidia/vllm:xx.xx-py3

Where:

xx.xx is the container version. For example, 25.09.

vLLM can be deployed in a client–server configuration. Start the HTTP inference server inside the container:

python3 -m vllm.entrypoints.openai.api_server --model nvidia/Llama-3.1-8B-Instruct-FP8 --trust-remote-code --tensor-parallel-size 1 --max-model-len 1024 --gpu-memory-utilization 0.85

From a client, issue a text-generation request by POST-ing to /generate with a JSON body containing the prompt and sampling parameters:

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/Llama-3.1-8B-Instruct-FP8",
"messages": [{"role":"user", "content": "What is NVIDIA famous for?"}]
}'

Notes:

Using some HuggingFace models requires HF authentication:

hf auth login  --token [YOUR_TOKEN]

. Please refer to [this guide] (https://huggingface.co/docs/hub/en/security-tokens#user-access-tokens) for how to generate HF tokens.

On Jetson Thor and NVIDIA DGX Spark, if you run into out-of-memory (OOM) issues when starting the server, you might want to clean the memory cache before running the vLLM server:

sync && echo 3 > /proc/sys/vm/drop_caches

See /workspace/README.md inside the container for information on getting started and customizing your vLLM image.

You might want to pull in data and model descriptions from locations outside the container for use by vLLM. To accomplish this, the easiest method is to mount one or more host directories as Docker bind mounts. For example:

docker run --gpus all -it --rm -v local_dir:container_dir nvcr.io/nvidia/vllm:xx.xx-py3

What Is In This Container?

For the full list of contents, see the vLLM Container Release Notes.

This container image contains the complete source of the version of vLLM in /opt/vllm. It is pre-built and installed in the default system Python environment (/usr/local/lib/python3.12/dist-packages/vllm) in the container image. Visit vLLM Docs to learn more about vLLM.

The NVIDIA vLLM Container is optimized for use with NVIDIA GPUs, and contains the following software for GPU acceleration:

The software stack in this container has been validated for compatibility, and does not require any additional installation or compilation from the end user. This container can help accelerate your deep learning workflow from end to end.

Link to Open Source Code

Security Common Vulnerabilities and Exposures (CVEs)

Please review the Security Scanning tab on NGC to view the latest security scan results. For certain open-source vulnerabilities listed in the scan results, NVIDIA provides a response in the form of a Vulnerability Exploitability eXchange (VEX) document. The VEX information can be reviewed and downloaded from the Security Scanning tab.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal developer team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

License

By pulling and using the container, you accept the terms and conditions of this End User License Agreement and Product-Specific Terms.

Publisher

NVIDIA

Latest Tag26.05.post1-py3

UpdatedJune 2, 2026 UTC

Compressed Size8.94 GB

Multinode SupportYes

Multi-Arch SupportYes

System

signed images

Labels

AI Conversational AI DL DLFW High Performance Computing HPC / Supercomputing Inference ML Natural Language Processing Natural Language Understanding NSPECT-EQZO-3F6K NVIDIA AI Question Answering Translation

vLLM

Prerequisites

Running vLLM Using Docker

What Is In This Container?

Suggested Reading

Security Common Vulnerabilities and Exposures (CVEs)

Ethical Considerations

License