Linux / amd64
NVIDIA System Management (NVSM) is a software framework for monitoring NVIDIA DGX nodes in a data center.
For DGX Servers, it includes active health monitoring, system alerts, and log generation.
For DGX Station, is it limited to using the CLI to check the health of the system and obtain diagnostic information.
The v1.0.0-21.07.x release is the first release of the NVSM containers. This first release supports only three operations namely; show health, dump health and show versions.
The "show health" command can be used to quickly assess overall system health.
To run show health, nvsm needs to be initialized from inside the pod and once on the prompt 'show health' command can be directly run.
nvsm> show health
This command will print a summary of the system state.
The "dump health" command produces a health report file suitable for attaching to support tickets.
To run dump health, nvsm needs to be initialized from inside the pod and once on the prompt 'dump health' command can be directly run.
nvsm> dump health
This command will create a .tar.xz file which can be copied out from the pod and then analyzed/attached with tickets.
The "show versions" command can be used to get information of the versions of the packages and firmware installed on the system.
To run show versions, nvsm needs to be initialized from inside the pod and once on the prompt 'show versions' command can be directly run.
nvsm> show versions
This command then prints versions of software/hardware components on the system.
Complete NVSM Documentation is available here.
License here.