Container
NVIDIA System Management (NVSM) Aggregator is a multinode software framework for monitoring NVIDIA DGX nodes cluster.
latest
Signed
This image has a digital signature verifying that it has not been altered or corrupted since its signing.
ScannedNo malware was found in this artifact.
Copy the image path for this tag below:
View all tagsCopied!
NVIDIA System Management (NVSM) Aggregator is a multinode software framework for monitoring NVIDIA DGX nodes cluster
The NVSM (NVIDIA System Management) Aggregator Container is a Docker container that serves as a central management interface for monitoring multiple nodes within a cluster. It provides a unified view of system health, metrics, and management capabilities across all connected nodes.
Key Features
- Centralized monitoring of multiple DGX systems
- Real-time health monitoring and alerting
- Cluster-wide metrics collection
- Unified management interface
- MQTT-based communication with compute nodes
- Prometheus/Grafana integration for metrics visualization
Architecture
Core Components
-
NVSM Core Service
- Central management service
- Handles node communication
- Processes system metrics
- Manages alerts and notifications
-
MQTT Server
- Communication broker for compute nodes
- Handles real-time data exchange
- Manages node connections
- Port: 8883
-
Metrics Exporter
- Exports system metrics in Prometheus format
- Provides cluster-wide monitoring data
- Port: 9123
-
Grafana Dashboard (Optional)
- Visualizes system metrics
- Customizable dashboards
- Port: 3000
-
Prometheus (Optional)
- Metrics storage and querying
- Time-series database
- Port: 9090
Container Specifications
System Requirements
- x86 server for hosting
- Docker and docker-compose
Network Configuration
- MQTT Server: 8883/tcp
- Grafana: 3000/tcp
- Metrics Exporter: 9123/tcp
- Prometheus: 9090/tcp
Storage Requirements
- Configuration:
/etc/nvsm/aggregator
Node Management
Node Connection
- Secure MQTT communication
- Certificate-based authentication
- Node health monitoring
Supported Operations
- System health monitoring
- Alert management
- Metrics collection
- Configuration management
- Node provisioning
- Inventory management
Security Features
- Encrypted communication (MQTT over TLS)
- Certificate-based authentication
- Secure credential storage
Monitoring Capabilities
- System health status
- GPU metrics
- Storage metrics
- Network metrics
- Power consumption
- Temperature monitoring
- Fan speeds
- Memory utilization
Integration Points
- Prometheus for metrics storage
- Grafana for visualization
- Custom alerting systems
- API access for tools integeration
References
Publisher
NVIDIA
Latest Taglatest
UpdatedJanuary 30, 2026 UTC
Compressed Size711.92 MB
Multinode SupportNo
Multi-Arch SupportNo
System