vLLM is a fast and easy-to-use library for LLM inference and serving. The NVIDIA vLLM NGC Container is optimized for GPU acceleration, and contains a validated set of libraries that enable and optimize GPU performance.
vLLM
vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
NGC Containers are one of the easiest ways to get started with vLLM. The vLLM NGC Container comes with all dependencies included, providing an easy place to start developing and deploying common applications, such as conversational AI, natural language processing (NLP), recommenders, and computer vision.
The vLLM NGC Container is optimized for GPU acceleration, and contains a validated set of libraries that enable and optimize GPU performance.
Prerequisites
Using the vLLM NGC Container requires the host system to have the following installed:
For supported versions, see the Framework Containers Support Matrix and the NVIDIA Container Toolkit Documentation.
No other installation, compilation, or dependency management is required. It is not necessary to install the NVIDIA CUDA Toolkit.
Running vLLM Using Docker
To run a container, issue the appropriate command as explained in the Running A Container chapter in the NVIDIA Containers For Deep Learning Frameworks User’s Guide and specify the registry, repository, and tags. For more information about using NGC, refer to the NGC Container User Guide.
If you have Docker 19.03 or later, a typical command to launch the container is:
docker run --gpus all -it --rm nvcr.io/nvidia/vllm:xx.xx-py3
If you have Docker 19.02 or earlier, a typical command to launch the container is:
nvidia-docker run -it --rm -v nvcr.io/nvidia/vllm:xx.xx-py3
Where:
xx.xxis the container version. For example,25.09.
vLLM can be deployed in a client–server configuration. Start the HTTP inference server inside the container:
python3 -m vllm.entrypoints.openai.api_server --model nvidia/Llama-3.1-8B-Instruct-FP8 --trust-remote-code --tensor-parallel-size 1 --max-model-len 1024 --gpu-memory-utilization 0.85
From a client, issue a text-generation request by POST-ing to /generate with a JSON body containing the prompt and sampling parameters:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/Llama-3.1-8B-Instruct-FP8",
"messages": [{"role":"user", "content": "What is NVIDIA famous for?"}]
}'
Notes:
- Using some HuggingFace models requires HF authentication:
hf auth login --token [YOUR_TOKEN]
. Please refer to [this guide] (https://huggingface.co/docs/hub/en/security-tokens#user-access-tokens) for how to generate HF tokens.
- On Jetson Thor and NVIDIA DGX Spark, if you run into out-of-memory (OOM) issues when starting the server, you might want to clean the memory cache before running the vLLM server:
sync && echo 3 > /proc/sys/vm/drop_caches
See /workspace/README.md inside the container for information on getting started and customizing your vLLM image.
You might want to pull in data and model descriptions from locations outside the container for use by vLLM. To accomplish this, the easiest method is to mount one or more host directories as Docker bind mounts. For example:
docker run --gpus all -it --rm -v local_dir:container_dir nvcr.io/nvidia/vllm:xx.xx-py3
What Is In This Container?
For the full list of contents, see the vLLM Container Release Notes.
This container image contains the complete source of the version of vLLM in /opt/vllm. It is pre-built and installed in the default system Python environment (/usr/local/lib/python3.12/dist-packages/vllm) in the container image. Visit vLLM Docs to learn more about vLLM.
The NVIDIA vLLM Container is optimized for use with NVIDIA GPUs, and contains the following software for GPU acceleration:
- CUDA
- cuBLAS
- NVIDIA cuDNN
- NVIDIA NCCL (optimized for NVLink)
- NVIDIA Data Loading Library (DALI)
- PyTorch
The software stack in this container has been validated for compatibility, and does not require any additional installation or compilation from the end user. This container can help accelerate your deep learning workflow from end to end.
Suggested Reading
For the latest Release Notes, see the vLLM Release Notes.
For a full list of the supported software and specific versions that come packaged with this framework based on the container image, see the Frameworks Support Matrix.
For more information about vLLM, including tutorials, documentation, and examples, see:
Security Common Vulnerabilities and Exposures (CVEs)
Please review the Security Scanning tab on NGC to view the latest security scan results. For certain open-source vulnerabilities listed in the scan results, NVIDIA provides a response in the form of a Vulnerability Exploitability eXchange (VEX) document. The VEX information can be reviewed and downloaded from the Security Scanning tab.
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal developer team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
License
By pulling and using the container, you accept the terms and conditions of this End User License Agreement and Product-Specific Terms.