DCGM Exporter

NGC Catalog

CLASSIC

Welcome Guest

For copy image paths and more information, please view on a desktop device.

Features

Description

Monitor GPUs in Kubernetes using NVIDIA DCGM. This is an exporter for a Prometheus monitoring solution in Kubernetes.

Publisher

NVIDIA

Latest Tag

4.2.3-4.3.0-ubi9

Modified

August 1, 2025

Compressed Size

230.56 MB

Multinode Support

Multi-Arch Support

Yes

4.2.3-4.3.0-ubi9 (Latest) Security Scan Results

Linux / amd64

Linux / arm64

Overview

Monitoring stacks usually consist of a collector, a time-series database to store metrics and a visualization layer. A popular open-source stack is Prometheus used along with Grafana as the visualization tool to create rich dashboards. Prometheus is deployed along with kube-state-metrics and node_exporter to expose cluster-level metrics for Kubernetes API objects and node-level metrics such as CPU utilization.

NVIDIA DCGM

NVIDIA DCGM is a set of tools for managing and monitoring NVIDIA GPUs in large scale linux based cluster environments. It's a low overhead tool that can perform a variety of functions including active health monitoring, diagnostics, system validation, policies, power and clock management, group configuration and accounting.

DCGM-Exporter is an exporter for Prometheus to monitor the health and get metrics from GPUs. It leverages DCGM using Go bindings to collect GPU telemetry and exposes GPU metrics to Prometheus using an http endpoint (/metrics). DCGM-Exporter can be used either standalone or deployed as part of the NVIDIA GPU Operator.

Usage

For using the DCGM-Exporter, visit the user guide

License Agreements

By pulling and using the container, you accept the terms and conditions of this End User License Agreement.

DCGM Exporter

Overview

NVIDIA DCGM

DCGM Exporter

Usage

License Agreements

Suggested Reading