Linux / amd64
Linux / arm64
NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs in cluster environments. It includes active health monitoring, comprehensive diagnostics, system alerts and governance policies including power and clock management. It can be used standalone by infrastructure teams and easily integrates into cluster management tools, resource scheduling and monitoring products from NVIDIA partners.
This container image implements a standalone DCGM service. Clients can connect to the DCGM container to access the relevant functionality such as GPU health or telemetry. See the tags for the image flavors available.
The standalone DCGM container exposes the nv-hostengine
service at port 5555 that clients (which interact with DCGM through libdcgm.so
) can connect to and access the desired functionality provided by DCGM. In this section, we present some common use-cases.
In this scenario, the DCGM standalone container has been started with the following command, where the port 5555 is mapped into the host so that other clients can access the nv-hostengine
service running in the container. Note that to gather profiling metrics, SYS_ADMIN
capabilities need to be provided to the container:
$ docker run --gpus all \
--cap-add SYS_ADMIN \
-p 5555:5555 \
nvidia/k8s/dcgm:2.2.3-ubuntu20.04
Now a client such dcgmi dmon
can stream GPU telemetry/metrics on the console.
In this scenario, DCGM doesn't need any additional caps and can run unprivileged:
$ docker run --gpus all \
-p 5555:5555 \
nvidia/k8s/dcgm:2.2.3-ubuntu20.04
The DCGM APIs for reporting health can now be accessed through clients connecting to the DCGM container.
For more information on DCGM and documentation, visit the product page: https://developer.nvidia.com/dcgm.