NGC Catalog
CLASSIC
Welcome Guest
Containers
TensorRT-LLM Release

TensorRT-LLM Release

For copy image paths and more information, please view on a desktop device.
Description
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.
Publisher
NVIDIA
Latest Tag
0.20.0rc4
Modified
May 24, 2025
Compressed Size
27.85 GB
Multinode Support
No
Multi-Arch Support
Yes
0.20.0rc4 (Latest) Security Scan Results

Linux / amd64

Sorry, your browser does not support inline SVG.

Linux / arm64

Sorry, your browser does not support inline SVG.

Description

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.

Overview

TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.

Visit the official GitHub repository for more information: https://github.com/NVIDIA/TensorRT-LLM.

Running TensorRT-LLM Using Docker

If you have Docker 19.03 or later, a typical command to launch the container is:

docker run --gpus all -it --rm --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tensorrt-llm/release:x.xx.x

If you have Docker 19.02 or earlier, a typical command to launch the container is:

nvidia-docker run -it --rm -v --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tensorrt-llm/release:x.xx.x

where: x.xx.x is the container version. For example, 0.20.0. TensorRT-LLM binaries are installed at /app/tensorrt_llm, and users can import it as a Python module:

$ python
>>> import tensorrt_llm

Developing TensorRT-LLM Using Docker

To develop TensorRT-LLM using docker, users can mount the code directory into the develop container.

# Clone the repository
git clone git@github.com:NVIDIA/TensorRT-LLM.git <repo>

# Launch the docker container
nvidia-docker run -it --rm -v --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
  --mount type=bind,source=${path},target=${path} \
  nvcr.io/nvidia/tensorrt-llm/release:x.xx.x

After getting in the container, users can build the wheel and make their modifications effective.

# Build the TensorRT-LLM code.
python3 ./scripts/build_wheel.py

Refer to the documentation https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html#build-tensorrt-llm for more information.

Build TensorRT-LLM Docker Container

Refer to the documentation: https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html#building-a-tensorrt-llm-docker-image for the detailed instructions. Suggested Reading For the latest release notes, see the TensorRT-LLM Release Notes.

For a full list of supported models and hardwares, see the Support Matrix.

For more information about TensorRT-LLM, including documentation, source code and examples, see: https://github.com/NVIDIA/TensorRT-LLM https://nvidia.github.io/TensorRT-LLM/latest/index.html

Security CVEs

To review known CVEs on this image, refer to the Security Scanning tab on this page.

License

By pulling and using the container, you accept the terms and conditions of this End User License Agreement and Product-Specific Terms.