NGC Catalog
CLASSIC
Welcome Guest
Helm Charts
nvsm

nvsm

For versions and more information, please view on a desktop device.
Description
A Helm chart for deploying Nvidia System Management software on DGX Nodes
Publisher
Nvidia
Latest Version
1.0.1
Compressed Size
3.66 KB
Modified
June 3, 2025

NVIDIA System Management Support with GPU Operator

NVIDIA System Management (NVSM) is a software package for monitoring NVIDIA DGX nodes in a data center. The NVIDIA GPU Operator is a software stack that leverages the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. This document provides instructions for deploying the containerized NVSM module along with the NVIDIA GPU Operator.

The NVSM pod is deployed as a DaemonSet for nodes which are marked with nvidia.com/gpu.nvsm.deploy: "true". Prior to deployment, all DGX nodes must be labelled with nvidia.com/gpu.nvsm.deploy: "true".


Deploying NVSM Container

The NVSM pod will be deployed in a Kubernetes or OpenShift environment using a Helm chart. The Helm chart can be fetched via:

helm fetch https://helm.ngc.nvidia.com/nvidia/nvsm/nvsm-1.0.1.tgz --username='$oauthtoken' --password=NGC_API_KEY

Prior to deployment, the values.yaml can be modified to point to the correct container image and tag to be used for the deployment.

Once fetched, the Helm chart can be installed via:

# helm install nvidia-nvsm .

Verify NVSM Container on All DGX Nodes

Once the NVSM container is deployed using the Helm chart, the NVSM pod should be running in each DGX node under the release namespace specified:

# oc get pods -n RELEASE_NAMESPACE
# or
# kubectl get pods -n RELEASE_NAMESPACE

Review the output of this command to ensure that NVSM pods are running on all DGX nodes.


NVSM Commands

For any maintenance task to be performed on a specific node, the user must enter into the NVSM container running on the specific node. The container name associated with NVSM on a given node can be found in the get pods output.

# oc exec -it -n RELEASE_NAMESPACE <POD_NAME> -- /bin/bash
# or
# kubectl exec -it -n RELEASE_NAMESPACE <POD_NAME> -- /bin/bash

This will provide a shell on the target container. To run NVSM commands, first the NVSM core has to be initialized by running the command nvsm. This command will initialize the NVSM core and provide an NVSM command prompt to run other supported NVSM commands. This can take a couple of minutes.

# nvsm

Initializing NVSM Core...

nvsm->

With the initial version of containerized NVSM, three commands are primarily supported:

  • nvsm show version: Shows various software/firmware versions on the DGX server.
  • nvsm show health: Provides a summary of system health.
  • nvsm dump health: Creates a snapshot of various system components for offline analysis and diagnosis.

nvsm dump health will create a tarball within the NVSM container. Once the dump operation is complete, the file can be copied to the master node and shared with NVIDIA Enterprise support for analysis.

# oc cp RELEASE_NAMESPACE/<POD_NAME>:/tmp/<DUMP_FILE> .
# or
# kubectl cp RELEASE_NAMESPACE/<POD_NAME>:/tmp/<DUMP_FILE> .

This command can be used to copy out the dump file from the container.


Documentation

  • Official Product Documentation

License

  • License