NVIDIA
NVIDIA
NVSM-Aggregator
Container
NVIDIA
NVIDIA
NVSM-Aggregator

NVIDIA System Management (NVSM) Aggregator is a multinode software framework for monitoring NVIDIA DGX nodes cluster.

NVIDIA System Management (NVSM) Aggregator is a multinode software framework for monitoring NVIDIA DGX nodes cluster

The NVSM (NVIDIA System Management) Aggregator Container is a Docker container that serves as a central management interface for monitoring multiple nodes within a cluster. It provides a unified view of system health, metrics, and management capabilities across all connected nodes.

Key Features

  • Centralized monitoring of multiple DGX systems
  • Real-time health monitoring and alerting
  • Cluster-wide metrics collection
  • Unified management interface
  • MQTT-based communication with compute nodes
  • Prometheus/Grafana integration for metrics visualization

Architecture

Core Components

  1. NVSM Core Service

    • Central management service
    • Handles node communication
    • Processes system metrics
    • Manages alerts and notifications
  2. MQTT Server

    • Communication broker for compute nodes
    • Handles real-time data exchange
    • Manages node connections
    • Port: 8883
  3. Metrics Exporter

    • Exports system metrics in Prometheus format
    • Provides cluster-wide monitoring data
    • Port: 9123
  4. Grafana Dashboard (Optional)

    • Visualizes system metrics
    • Customizable dashboards
    • Port: 3000
  5. Prometheus (Optional)

    • Metrics storage and querying
    • Time-series database
    • Port: 9090

Container Specifications

System Requirements

  • x86 server for hosting
  • Docker and docker-compose

Network Configuration

  • MQTT Server: 8883/tcp
  • Grafana: 3000/tcp
  • Metrics Exporter: 9123/tcp
  • Prometheus: 9090/tcp

Storage Requirements

  • Configuration: /etc/nvsm/aggregator

Node Management

Node Connection

  • Secure MQTT communication
  • Certificate-based authentication
  • Node health monitoring

Supported Operations

  • System health monitoring
  • Alert management
  • Metrics collection
  • Configuration management
  • Node provisioning
  • Inventory management

Security Features

  • Encrypted communication (MQTT over TLS)
  • Certificate-based authentication
  • Secure credential storage

Monitoring Capabilities

  • System health status
  • GPU metrics
  • Storage metrics
  • Network metrics
  • Power consumption
  • Temperature monitoring
  • Fan speeds
  • Memory utilization

Integration Points

  • Prometheus for metrics storage
  • Grafana for visualization
  • Custom alerting systems
  • API access for tools integeration

References

Publisher
NVIDIA
NVIDIA
Latest Taglatest
UpdatedJanuary 30, 2026 UTC
Compressed Size711.92 MB
Multinode SupportNo
Multi-Arch SupportNo