NGC | Catalog

DCGM

For copy image paths and more information, please view on a desktop device.
Logo for DCGM

Description

Manage and Monitor GPUs in Cluster Environments.

Publisher

NVIDIA

Latest Tag

3.0.4-1-ubi8

Modified

December 1, 2022

Compressed Size

596.08 MB

Multinode Support

No

Multi-Arch Support

Yes

3.0.4-1-ubi8 (Latest) Scan Results

Linux / arm64

Linux / amd64

Introduction

NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs in cluster environments. It includes active health monitoring, comprehensive diagnostics, system alerts and governance policies including power and clock management. It can be used standalone by infrastructure teams and easily integrates into cluster management tools, resource scheduling and monitoring products from NVIDIA partners.

This container image implements a standalone DCGM service. Clients can connect to the DCGM container to access the relevant functionality such as GPU health or telemetry. See the tags for the image flavors available.

Usage

The standalone DCGM container exposes the nv-hostengine service at port 5555 that clients (which interact with DCGM through libdcgm.so) can connect to and access the desired functionality provided by DCGM. In this section, we present some common use-cases.

Access GPU Telemetry

In this scenario, the DCGM standalone container has been started with the following command, where the port 5555 is mapped into the host so that other clients can access the nv-hostengine service running in the container. Note that to gather profiling metrics, SYS_ADMIN capabilities need to be provided to the container:

$ docker run --gpus all \
   --cap-add SYS_ADMIN \
   -p 5555:5555 \
   nvidia/k8s/dcgm:2.2.3-ubuntu20.04

Now a client such dcgmi dmon can stream GPU telemetry/metrics on the console.

GPU Health

In this scenario, DCGM doesn't need any additional caps and can run unprivileged:

$ docker run --gpus all \
   -p 5555:5555 \
   nvidia/k8s/dcgm:2.2.3-ubuntu20.04

The DCGM APIs for reporting health can now be accessed through clients connecting to the DCGM container.

Suggested Reading

For more information on DCGM and documentation, visit the product page: https://developer.nvidia.com/dcgm.

License