DCGM

NGC Catalog

CLASSIC

Welcome Guest

For copy image paths and more information, please view on a desktop device.

Features

Description

Manage and Monitor GPUs in Cluster Environments.

Publisher

NVIDIA

Latest Tag

4.4.0-1-ubuntu22.04

Modified

August 19, 2025

Compressed Size

1.02 GB

Multinode Support

Multi-Arch Support

Yes

4.4.0-1-ubuntu22.04 (Latest) Security Scan Results

Linux / amd64

Linux / arm64

Introduction

NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs in cluster environments. It includes active health monitoring, comprehensive diagnostics, system alerts and governance policies including power and clock management. It can be used standalone by infrastructure teams and easily integrates into cluster management tools, resource scheduling and monitoring products from NVIDIA partners.

This container image implements a standalone DCGM service. Clients can connect to the DCGM container to access the relevant functionality such as GPU health or telemetry. See the tags for the image flavors available.

Usage

The standalone DCGM container exposes the nv-hostengine service at port 5555 that clients (which interact with DCGM through libdcgm.so) can connect to and access the desired functionality provided by DCGM. In this section, we present some common use-cases.

Access GPU Telemetry

In this scenario, the DCGM standalone container has been started with the following command, where the port 5555 is mapped into the host so that other clients can access the nv-hostengine service running in the container. Note that to gather profiling metrics, SYS_ADMIN capabilities need to be provided to the container:

$ docker run --gpus all \
   --cap-add SYS_ADMIN \
   -p 5555:5555 \
   nvidia/k8s/dcgm:2.2.3-ubuntu20.04

Now a client such dcgmi dmon can stream GPU telemetry/metrics on the console.

GPU Health

In this scenario, DCGM doesn't need any additional caps and can run unprivileged:

$ docker run --gpus all \
   -p 5555:5555 \
   nvidia/k8s/dcgm:2.2.3-ubuntu20.04

The DCGM APIs for reporting health can now be accessed through clients connecting to the DCGM container.

License

By pulling and using the container, you accept the terms and conditions of this End User License Agreement.

Introduction

Usage

Access GPU Telemetry

GPU Health

Suggested Reading

License