NVIDIA System Management support with GPU Operator

NVIDIA System Management (NVSM) is a software package for monitoring NVIDIA DGX nodes in a data center. The NVIDIA GPU Operator is a software stack that leverages operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. This document provides instructions for deploying containerized NVSM module along with NVIDIA GPU Operator.

The NVSM pod is deployed as a daemonset for Nodes which are marked with nvidia.com/gpu.nvsm.deploy: "true". Prior to deployment all DGX nodes must be labelled with nvidia.com/gpu.nvsm.deploy: "true".

Deploying NVSM Container

NVSM POD will be deployed in a Kubernetes or OpenShift environment using helm chart. The helm chart can be fetched via,

helm fetch https://helm.ngc.nvidia.com/nvidia/charts/nvsm-1.0.0.tgz --username='$oauthtoken' --password=NGC_API_KEY

Prior to deployment the values.yaml can be modified to point to the correct container image and tag to be used for the deployment.

Once fetched, the helm chart can be installed via,

#helm install nvidia-nvsm .

Verify NVSM container on all DGX nodes

Once the nvsm container is deployed using helm chart the nvsm POD should be running in each DGX node under the release namespace specified,

#oc get pods -n RELEASE_NAMESPACE or

#kubectl get pods -n RELEASE_NAMESPACE

review the output of this command to ensure that nvsm PODs are running on all DGX nodes.

NVSM commands

For any maintenance task to be performed on specific node, user must enter in to the NVSM container running on the specific node. The container name associated with nvsm on a given node can be found in the get pods output.

#oc exec -it -n RELEASE_NAMESPACE -- /bin/bash or

#kubectl exec -it -n RELEASE_NAMESPACE -- /bin/bash

This will provide a shell on the target container. To run nvsm commands first nvsm core has to be initialized by running command nvsm. This command will initialize nvsm core and provide a nvsm command prompt to run other supported nvsm commands. This can take couple of minutes.

#nvsm

Initializing NVSM Core...

nvsm->

With the initial version of containerized NVSM three commands are primarily supported,

nvsm show version: This command shows various software/firmware version on the DGX server.
nvsm show health: This command provides a summary of system health
nvsm dump health: This command creates a snapshot of various system components for offline analysis and diagnosis.

nvsm dump health will create tar ball within the nvsm container. Once the dump operation is complete, the file can be copied to the master node and shared with NVIDIA Enterprise support for analysis.

#oc cp RELEASE_NAMESPACE/:/tmp/ or

#kubectl cp RELEASE_NAMESPACE/:/tmp/

This command can be used to copy out the dump file form container.

Documentation

Official Product Documentation here

License

License here.