NVIDIA
NVIDIA
vLLM
Container
NVIDIA
NVIDIA
vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. The NVIDIA vLLM NGC Container is optimized for GPU acceleration, and contains a validated set of libraries that enable and optimize GPU performance.

vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

NGC Containers are one of the easiest ways to get started with vLLM. The vLLM NGC Container comes with all dependencies included, providing an easy place to start developing and deploying common applications, such as conversational AI, natural language processing (NLP), recommenders, and computer vision.

The vLLM NGC Container is optimized for GPU acceleration, and contains a validated set of libraries that enable and optimize GPU performance.

Prerequisites

Using the vLLM NGC Container requires the host system to have the following installed:

For supported versions, see the Framework Containers Support Matrix and the NVIDIA Container Toolkit Documentation.

No other installation, compilation, or dependency management is required. It is not necessary to install the NVIDIA CUDA Toolkit.

Running vLLM Using Docker

To run a container, issue the appropriate command as explained in the Running A Container chapter in the NVIDIA Containers For Deep Learning Frameworks User’s Guide and specify the registry, repository, and tags. For more information about using NGC, refer to the NGC Container User Guide.

If you have Docker 19.03 or later, a typical command to launch the container is:

docker run --gpus all -it --rm nvcr.io/nvidia/vllm:xx.xx-py3

If you have Docker 19.02 or earlier, a typical command to launch the container is:

nvidia-docker run -it --rm -v nvcr.io/nvidia/vllm:xx.xx-py3

Where:

  • xx.xx is the container version. For example, 25.09.

vLLM can be deployed in a client–server configuration. Start the HTTP inference server inside the container:

python3 -m vllm.entrypoints.openai.api_server --model nvidia/Llama-3.1-8B-Instruct-FP8 --trust-remote-code --tensor-parallel-size 1 --max-model-len 1024 --gpu-memory-utilization 0.85

From a client, issue a text-generation request by POST-ing to /generate with a JSON body containing the prompt and sampling parameters:

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/Llama-3.1-8B-Instruct-FP8",
"messages": [{"role":"user", "content": "What is NVIDIA famous for?"}]
}'

Notes:

  1. Using some HuggingFace models requires HF authentication:
hf auth login  --token [YOUR_TOKEN]

. Please refer to [this guide] (https://huggingface.co/docs/hub/en/security-tokens#user-access-tokens) for how to generate HF tokens.

  1. On Jetson Thor and NVIDIA DGX Spark, if you run into out-of-memory (OOM) issues when starting the server, you might want to clean the memory cache before running the vLLM server:
sync && echo 3 > /proc/sys/vm/drop_caches

See /workspace/README.md inside the container for information on getting started and customizing your vLLM image.

You might want to pull in data and model descriptions from locations outside the container for use by vLLM. To accomplish this, the easiest method is to mount one or more host directories as Docker bind mounts. For example:

docker run --gpus all -it --rm -v local_dir:container_dir nvcr.io/nvidia/vllm:xx.xx-py3

What Is In This Container?

For the full list of contents, see the vLLM Container Release Notes.

This container image contains the complete source of the version of vLLM in /opt/vllm. It is pre-built and installed in the default system Python environment (/usr/local/lib/python3.12/dist-packages/vllm) in the container image. Visit vLLM Docs to learn more about vLLM.

The NVIDIA vLLM Container is optimized for use with NVIDIA GPUs, and contains the following software for GPU acceleration:

The software stack in this container has been validated for compatibility, and does not require any additional installation or compilation from the end user. This container can help accelerate your deep learning workflow from end to end.

Link to Open Source Code

Suggested Reading

For the latest Release Notes, see the vLLM Release Notes.

For a full list of the supported software and specific versions that come packaged with this framework based on the container image, see the Frameworks Support Matrix.

For more information about vLLM, including tutorials, documentation, and examples, see:

Security Common Vulnerabilities and Exposures (CVEs)

Please review the Security Scanning tab on NGC to view the latest security scan results. For certain open-source vulnerabilities listed in the scan results, NVIDIA provides a response in the form of a Vulnerability Exploitability eXchange (VEX) document. The VEX information can be reviewed and downloaded from the Security Scanning tab.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal developer team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

License

By pulling and using the container, you accept the terms and conditions of this End User License Agreement and Product-Specific Terms.

Publisher
NVIDIA
NVIDIA
Latest Tag26.05.post1-py3
UpdatedJune 2, 2026 UTC
Compressed Size8.94 GB
Multinode SupportYes
Multi-Arch SupportYes

NVIDIA uses cookies to improve your experience on our web site. We and our third-party partners also use cookies and other tools to collect and record information you provide as well as information about your interactions with our websites for performance improvement, analytics, and to assist in marketing efforts. By clicking "Accept All", you consent to our use of cookies and other tools as described in our Cookie Policy. You can manage your cookie settings by clicking on "Manage Settings." By continuing to use this site or by clicking one of the buttons below, you agree to our Terms of Service (which contains important waivers). Please see our Privacy Policy for more information on our privacy practices.