DCGM

Logo for DCGM
Features
Description
Manage and Monitor GPUs in Cluster Environments.
Publisher
NVIDIA
Latest Tag
3.3.6-1-ubi9
Modified
May 20, 2024
Compressed Size
988.61 MB
Multinode Support
No
Multi-Arch Support
Yes
3.3.6-1-ubi9 (Latest) Security Scan Results

Linux / arm64

Sorry, your browser does not support inline SVG.

Linux / amd64

Sorry, your browser does not support inline SVG.

Introduction

NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs in cluster environments. It includes active health monitoring, comprehensive diagnostics, system alerts and governance policies including power and clock management. It can be used standalone by infrastructure teams and easily integrates into cluster management tools, resource scheduling and monitoring products from NVIDIA partners.

This container image implements a standalone DCGM service. Clients can connect to the DCGM container to access the relevant functionality such as GPU health or telemetry. See the tags for the image flavors available.

Usage

The standalone DCGM container exposes the nv-hostengine service at port 5555 that clients (which interact with DCGM through libdcgm.so) can connect to and access the desired functionality provided by DCGM. In this section, we present some common use-cases.

Access GPU Telemetry

In this scenario, the DCGM standalone container has been started with the following command, where the port 5555 is mapped into the host so that other clients can access the nv-hostengine service running in the container. Note that to gather profiling metrics, SYS_ADMIN capabilities need to be provided to the container:

$ docker run --gpus all \
   --cap-add SYS_ADMIN \
   -p 5555:5555 \
   nvidia/k8s/dcgm:2.2.3-ubuntu20.04

Now a client such dcgmi dmon can stream GPU telemetry/metrics on the console.

GPU Health

In this scenario, DCGM doesn't need any additional caps and can run unprivileged:

$ docker run --gpus all \
   -p 5555:5555 \
   nvidia/k8s/dcgm:2.2.3-ubuntu20.04

The DCGM APIs for reporting health can now be accessed through clients connecting to the DCGM container.

Suggested Reading

For more information on DCGM and documentation, visit the product page: https://developer.nvidia.com/dcgm.

License