NVIDIA
NVIDIA
NVSM-Aggregator
Container
NVIDIA
NVIDIA
NVSM-Aggregator

NVIDIA System Management (NVSM) Aggregator is a multinode software framework for monitoring NVIDIA DGX nodes cluster.

NVIDIA System Management (NVSM) Aggregator is a multinode software framework for monitoring NVIDIA DGX nodes cluster

The NVSM (NVIDIA System Management) Aggregator Container is a Docker container that serves as a central management interface for monitoring multiple nodes within a cluster. It provides a unified view of system health, metrics, and management capabilities across all connected nodes.

Key Features

  • Centralized monitoring of multiple DGX systems
  • Real-time health monitoring and alerting
  • Cluster-wide metrics collection
  • Unified management interface
  • MQTT-based communication with compute nodes
  • Prometheus/Grafana integration for metrics visualization

Architecture

Core Components

  1. NVSM Core Service

    • Central management service
    • Handles node communication
    • Processes system metrics
    • Manages alerts and notifications
  2. MQTT Server

    • Communication broker for compute nodes
    • Handles real-time data exchange
    • Manages node connections
    • Port: 8883
  3. Metrics Exporter

    • Exports system metrics in Prometheus format
    • Provides cluster-wide monitoring data
    • Port: 9123
  4. Grafana Dashboard (Optional)

    • Visualizes system metrics
    • Customizable dashboards
    • Port: 3000
  5. Prometheus (Optional)

    • Metrics storage and querying
    • Time-series database
    • Port: 9090

Container Specifications

System Requirements

  • x86 server for hosting
  • Docker and docker-compose

Network Configuration

  • MQTT Server: 8883/tcp
  • Grafana: 3000/tcp
  • Metrics Exporter: 9123/tcp
  • Prometheus: 9090/tcp

Storage Requirements

  • Configuration: /etc/nvsm/aggregator

Node Management

Node Connection

  • Secure MQTT communication
  • Certificate-based authentication
  • Node health monitoring

Supported Operations

  • System health monitoring
  • Alert management
  • Metrics collection
  • Configuration management
  • Node provisioning
  • Inventory management

Security Features

  • Encrypted communication (MQTT over TLS)
  • Certificate-based authentication
  • Secure credential storage

Monitoring Capabilities

  • System health status
  • GPU metrics
  • Storage metrics
  • Network metrics
  • Power consumption
  • Temperature monitoring
  • Fan speeds
  • Memory utilization

Integration Points

  • Prometheus for metrics storage
  • Grafana for visualization
  • Custom alerting systems
  • API access for tools integeration

References

Publisher
NVIDIA
NVIDIA
Latest Taglatest
UpdatedJanuary 30, 2026 UTC
Compressed Size711.92 MB
Multinode SupportNo
Multi-Arch SupportNo

NVIDIA uses cookies to improve your experience on our web site. We and our third-party partners also use cookies and other tools to collect and record information you provide as well as information about your interactions with our websites for performance improvement, analytics, and to assist in marketing efforts. By clicking "Accept All", you consent to our use of cookies and other tools as described in our Cookie Policy. You can manage your cookie settings by clicking on "Manage Settings." By continuing to use this site or by clicking one of the buttons below, you agree to our Terms of Service (which contains important waivers). Please see our Privacy Policy for more information on our privacy practices.