NGC Catalog

CLASSIC

Welcome Guest

For versions and more information, please view on a desktop device.

Description

Enables your containerized applications to be profiled by NVIDIA DevTools applications (currently, only using NVIDIA Nsight Systems).

Publisher

NVIDIA

Latest Version

1.0.7

Compressed Size

27.52 KB

Modified

April 2, 2025

NOTICE: NVIDIA DevTools Sidecar Injector has Moved

⚠️ This project has been integrated into the Nsight Operator and this project is deprecated.

Find the current Nsight Operator Helm chart (which now includes the NVIDIA DevTools Sidecar Injector) here: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/devtools/helm-charts/nsight-operator

NVIDIA DevTools Sidecar Injector

The NVIDIA DevTools Sidecar Injector enables your containerized applications to be profiled by NVIDIA DevTools applications (currently, only using Nsight Systems). This solution leverages a Kubernetes dynamic admission controller to inject an init container, volumes with the NVIDIA DevTools application and its configurations, environment variables, and a security context upon the creation or update of your Pod.

Prerequisites

Docker
kubectl version v1.19+
Helm v3.
Access to a Kubernetes v1.19+ cluster with the admissionregistration.k8s.io/v1 API enabled. Verify that by running the following command:

kubectl api-versions | grep admissionregistration.k8s.io/v1

The result should be:

admissionregistration.k8s.io/v1

Note: Additionally, the MutatingAdmissionWebhook and ValidatingAdmissionWebhook admission controllers should be added and listed in the correct order in the admission-control flag of kube-apiserver. Please refer to the Kubernetes documentation. It is likely that this is set by default if your cluster is running on EKS, AKS, OKE or GKE.

Installation

Configure installation
Install the NVIDIA Devtools Sidecar Injector (in this example configuration values were save in custom_values.yaml):

helm install -f custom_values.yaml \
    devtools-sidecar-injector https://helm.ngc.nvidia.com/nvidia/devtools/charts/devtools-sidecar-injector-1.0.7.tgz

Installation configuration

The NVIDIA DevTools Sidecar can be customized to suit particular needs. Most likely, you will need to configure the profile.devtoolArgs, profile.injectionMatch, profile.volumes, and profile.volumeMounts values. A values file can be used for setting these parameters.

Sample custom_values.yaml. This configuration will enable profiling for any instance of yourawesomeapp found in injection Pods.

# Nsight Systems profiling configuration
profile:
  # The arguments for the Nsight Systems. The placeholders will be replaced with the actual values.
  devtoolArgs: "profile --start-later true -o /home/auto_{PROCESS_NAME}_%{POD_FULLNAME}_%{CONTAINER_NAME}_{TIMESTAMP}_{UID}.nsys-rep"
  # The regex to match applications to profile.
  injectionMatch: "^(?!.*nsys( |$)).*\\byourawesomeapp.*$"

Sample custom_values_launch.yaml. This configuration will inject Nsight Systems for later profiling for any instance of yourawesomeapp found in injection Pods. nsys_k8s.py can be used further to start/stop collection.

# Nsight Systems profiling configuration
profile:
  # The arguments for the Nsight Systems. The placeholders will be replaced with the actual values.
  devtoolArgs: "launch"
  # The regex to match applications to profile.
  injectionMatch: "^(?!.*nsys( |$)).*\\byourawesomeapp.*$"

Sample custom_values_extended.yaml:

# Nsight Systems profiling configuration
profile:
  # A volume to store profiling results. It can be omitted, but in this case, the results will be lost after the pod
  # deletion and they will not be in the common location.
  # You may skip this section if you already have a shared volume for all the profiling pods.
  volumes:
    [
      {
        "name": "nsys-output-volume",
        "persistentVolumeClaim": { "claimName": "CSP-managed-disk" },
      },
    ]
  volumeMounts:
    [{ "name": "nsys-output-volume", "mountPath": "/mnt/nsys/output" }]
  # The arguments for the Nsight Systems. The placeholders will be replaced with the actual values.
  devtoolArgs: "profile --start-later false --duration 20 --kill none -o /mnt/nsys/output/auto_{PROCESS_NAME}_%{POD_FULLNAME}_%{CONTAINER_NAME}_{TIMESTAMP}_{UID}.nsys-rep"
  # The regex to match applications to profile.
  injectionMatch: "^(?!.*nsys( |$)).*\\byourawesomeapp.*$"

# Node configurations which should be performed. Currently, only kernel.perf_event_paranoid is supported.
machineConfig:
  - name: kernel.perf_event_paranoid
    value: -1

Configuration values

Variable	Description	Default value
profile.devtoolArgs	The parameters for Nsight Systems used during profiling are detailed in the Nsight Systems User Guide.A comprehensive list of available parameters is provided there. Placeholders within these parameters will be substituted with their actual values during execution. It is recommended to include {TIMESTAMP} and {UID} placeholders in the output file name to keep filenames unique. Otherwise, the report may be overwritten or not generated at all. Example: *profile -o /mnt/nsys/output/auto_{PROCESS_NAME}%{POD_FULLNAME}%{CONTAINER_NAME}{TIMESTAMP}{UID}.nsys-rep*
profile.injectionMatch	The regex used to match the application that is to be profiled.	`^(?!/bin/)(?!/sbin/)(?!/usr/bin/)(?!/usr/sbin/)(?!.nsys( \| $))(?!.cat( \| $)).*$`
profile.volumes	Additional volumes that will be injected into profiled containers. Can be useful for storing profiling results.
profile.volumeMounts	Volume mounts that will be injected into profiled containers. Can be useful for storing profiling results.
profile.env	Environment variables that will be injected into profiled containers.
sidecarImage.image	NVIDIA DevTools Sidecar image URL can be specified in case of custom registry usage (if the NVIDIA registry is not available). In the case of a private registry, the imagePullSecrets should also be specified.	The default Sidecar nvcr.io URL
devtoolBinariesImage.image	NVIDIA DevTools Binaries image URL can be specified in the case of custom registry usage (if the NVIDIA registry is not available). In the case of a private registry, the imagePullSecrets should also be specified.	The default Nsight Systems nvcr.io URL
imagePullSecrets	List of references to secrets within the same namespace for pulling Sidecar and DevTools binaries images. These secrets must be available in all namespaces containing pods that require profiling, as well as in the "nvidia-devtools-sidecar-injector" namespace.	None
privileged	Enables profiled containers to be run in privileged mode (can be used to collect GPU metrics).	None
capabilities	Enables profiled containers to be run with specific capabilities (for instance SYS_ADMIN can be used to collect GPU metrics)	None
machineConfig	Array of name/value pairs (system configurations) which should be updated before profiling on target nodes (currently, only kernel.perf_event_paranoid is supported). More info about kernel.perf_event_paranoid.To prevent the NVIDIA DevTools Sidecar Injector from updating node configurations, machineConfig: null in the custom_values.yaml file.	[{ name: kernel.perf_event_paranoid, value: 2 }]

Supported placeholders

Placeholder	Replacement
`{UID}`	The random alphanumeric string (8 symbols)
`{PROCESS_NAME}`	The profiled process name.
`{PROCESS_ID}`	The profiled process id
`{TIMESTAMP}`	The UNIX timestamp (in ms)
`%{ANY ENVIRONMENT VARIABLE}`	The "ANY ENVIRONMENT VARIABLE" environment variable inside a container. POD_FULLNAME and CONTAINER_NAME environment variables are set by the NVIDIA DevTools Sidecar injection

Enabling profiling on target resources

To enable automatic Sidecar injection for all Pods in a namespace, add the nvidia-devtools-sidecar-injector=enabled label to the namespace.

kubectl label namespaces <namespace name> nvidia-devtools-sidecar-injector=enabled

To enable automatic Sidecar injection for a specific resource in a namespace, add the nvidia-devtools-sidecar-injector=enabled label to the resource.

kubectl label <resource_tyoe> <pod-name> nvidia-devtools-sidecar-injector=enabled

At this point, any new pod will be considered for injection based on labels and injectionMatch

Existing resources

An already started pod cannot be injected. Instead you must restart the pod, to support profiling. By the same token if you remove the label or set the Pod label to disabled, you will need to restart them to remove the Sidecar injection.

Resource with more than one replica

kubectl rollout restart <resource type>/<resource name>

For example:

kubectl rollout restart deployment/amazing_service

Resource with only one replica

kubectl scale <resource type>/<resource name> --replicas=0
kubectl scale <resource type>/<resource name> --replicas=1

For example:

kubectl scale deployment/amazing_service --replicas=0
kubectl scale deployment/amazing_service --replicas=1

Control profiling

Profiling can be controlled using the nsys_k8s.py script. The script can be found in NVIDIA DevTools Sidecar Injector Resources.

nsys_k8s

This script facilitates the execution of Nsight Systems commands within profiled containers of Kubernetes pods. Additionally, it provides a convenient method for downloading profiling result. nsys_k8s searches for Pods that are labeled for profiling and looks for active Nsight Systems sessions launched by the Sidecar in them. The script supports Pods filtering using field selectors

Prerequisites for nsys_k8s

Python 3.6 or higher
Python dependencies installed (pip install -r requirements.txt)

Usage

Nsight Systems commands

The script supports executing Nsight Systems commands within containers of Kubernetes pods, with optional filters for targeting specific namespaces, containers, and pods. Nsight Systems commands are executed only on pods that have active Nsight Systems sessions. The general command structure is as follows:

./nsys_k8s.py [--field-selector SELECTOR] nsys [nsys_arguments...]

Argument	Description
`--field-selector`	(Optional) Filter Kubernetes objects to identify those on which an Nsight Systems command will be executed, based on the value(s) of one or more resource fields. Field selectors.
`nsys_arguments...`	Specify the Nsight Systems command and argumentsyou wish to execute. For example, start --sampling-frequency=5000. For commands which supports the --output argument, in case this argument is not present, the --output arguments will be generated based on profile.devtoolArgs Helm option value

Do not specify the session name in nsys_arguments - it will be obtained automatically.

`download` command

The script supports the download command to provide a convenient way for downloading profiling results from profiled Pods.

./nsys_k8s.py [--field-selector SELECTOR] download [destination]

Argument	Description
`--field-selector`	(Optional) Filter Kubernetes objects to identify those on which an Nsight Systems command will be executed, based on the value(s) of one or more resource fields. Field selectors.
`destination`	The path for the directory into which the profiling results will be downloaded.
`--remove-source`	(Optional) Delete source files from Pods after downloading them.

`check` command

The script supports the check command to provide a convenient way to check if a NVIDIA DevTools Sidecar Injector is injected into a specific Pod.

./nsys_k8s.py check [-n namespace] [pod]

Argument	Description
`-n`	(Optional) The namespace of the Pod to check.
`pod`	The name of the Pod to check.

Additional configuration options

Updating configuration

Sidecar Injector configurations can be modified after the installation. Please note, however, that the configuration of already injected Pods will not be updated until they are restarted.

helm upgrade -f custom_values.yaml \
    devtools-sidecar-injector https://helm.ngc.nvidia.com/nvidia/devtools/charts/devtools-sidecar-injector-1.0.7.tgz

Separate configurations

Sidecar Injector configurations can be customized for an individual namespace/pods. For doing that a ConfigMap with name nvidia-devtools-sidecar-injector-custom can be used.

Sample separate_configmap.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-devtools-sidecar-injector-custom
  labels:
    app: nvidia-devtools-sidecar-injector
data:
  injectionconfig.yaml: |
    {
      "devtoolArgs": "profile -o /mnt/nsys/output/auto_{PROCESS_NAME}_%{POD_FULLNAME}_%{CONTAINER_NAME}_{TIMESTAMP}_{UID}.nsys-rep",
      "injectionMatch": "^(?!.*nsys( |$)).*\byourotherawesomeapp.*$"
    }

GPU Metrics

GPU Metrics Samples can only be collected by one process per GPU. The most straightforward way to avoid collisions is to collect GPU metrics from a single custom DaemonSet per node. The following resources configuration can be used to achieve that:

kubectl apply -f ./gpu_metrics_resources.yaml

gpu_metrics_daemonset.yaml:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gpu-metrics-collector
  namespace: example-gpu-metrics-ns
  labels:
    nvidia-devtools-sidecar-injector: enabled
spec:
  template:
    spec:
      containers:
      - name: gpu-metrics-ubuntu-container
        image: ubuntu:22.04
        command: ["sleep", "infinity"]
        securityContext:
          privileged: true
      tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-devtools-sidecar-injector-custom
  namespace: example-gpu-metrics-ns
  labels:
    app: nvidia-devtools-sidecar-injector
data:
  injectionconfig.yaml: |
    {
      "devtoolArgs": "profile --start-later false --gpu-metrics-device=all -o /mnt/nsys/output/auto_gpu_metrics_%{POD_FULLNAME}_{TIMESTAMP}_{UID}.nsys-rep",
      "injectionMatch": "^sleep infinity$"
    }

The ConfigMap customizes profiling parameters (which ensure the GPU Metrics are collected) for the DaemonSet. Started by this DaemonSet Pod will be controllable by the nsys_k8s.py script.

Amazon AWS EFA Network Counters

[Amazon AWS EFA Network Counters](Amazon AWS EFA Network Counters) requires additional configuration to be sampled. The /sys/class/infiniband//ports/*/hw_counters/ directory is not mounted into a container by default, so it should be mounted into the container from the host machine.

Sample custom_values_efa_mount.yaml with the required volumes:


profile:
  # Files inside /sys/class/infiniband directory contain relative symbolic links to /sys/devices
  volumes:
    [
      {
        "name": "sys-class-infiniband",
        "hostPath": { "path": "/sys/class/infiniband", "type": "Directory" }
      },
      {
        "name": "sys-class-devices",
        "hostPath": { "path": "/sys/devices", "type": "Directory" }
      }
    ]
  volumeMounts:
    [
      { 
        "name": "sys-class-infiniband",
        "mountPath": "/mnt/nv/sys/class/infiniband",
        "readOnly": true
      },
      { 
        "name": "sys-class-infiniband",
        "mountPath": "/mnt/nv/sys/devices",
        "readOnly": true
      }
    ]
  # Enable and configure the EFA metrics plugin to collect metrics from a non-default sysfs location.
  devtoolArgs: "profile --enable efa_metrics,-efa-counters-sysfs=\"/mnt/nv/sys\" -o /home/auto_{PROCESS_NAME}_%{POD_FULLNAME}_%{CONTAINER_NAME}_{TIMESTAMP}_{UID}.nsys-rep"
  # The regex to match applications to profile.
  injectionMatch: "^/usr/bin/python3 /usr/local/bin/torchrun.*$"

Uninstall

Perform the following steps to uninstall the NVIDIA Devtools Sidecar Injector:

helm uninstall devtools-sidecar-injector

This will automatically delete all the resources created by devtools-sidecar-injector and remove all the nvidia-devtools-sidecar-injector labels from all the labeled resources.

Additionally, you can delete only the labels from all resources labeled with nvidia-devtools-sidecar-injector=enabled to clean up the resources from injection:

kubectl get all --all-namespaces -l nvidia-devtools-sidecar-injector=enabled -o custom-columns=:.metadata.name,NS:.metadata.namespace,KIND:.kind --no-headers | while read name namespace kind; do kubectl label $kind $name -n $namespace nvidia-devtools-sidecar-injector-; done

Troubleshooting

General errors

Sometimes you may find that pod is injected with sidecar container as expected, check the following items:

The nvidia-devtools-sidecar-injector in the nvidia-devtools-sidecar-injector namespace Pod is in running state and no error logs have been produced.
Check that the target Pod was correctly injected: ./nsys_k8s.py check [-n namespace] [pod]

GPU metrics collection error

Check that no other applications collect GPU metrics on a target Pod. For example it can be:
- Other injection with the enabled --gpu-metrics-device option. In that case, you can use a report from that injection or modify the configurations to ensure only one Pod is running with the GPU metrics option.
- If you have a GPU operator installed, it has a nvidia-dcgm-exporter (documentation) DaemonSet which collects GPU metrics. If you are not using it, you can temporary disable it:

kubectl -n gpu-operator patch daemonset nvidia-dcgm-exporter -p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'

To enable it back, you can call the command:

kubectl -n gpu-operator patch daemonset nvidia-dcgm-exporter --type json -p='[{"op": "remove", "path": "/spec/template/spec/nodeSelector/non-existing"}]'

OpenShift

Initialization of Pods with the profiler injected can be slower on OpenShift clusters during the first-time setup (post-configuration). This is due to the more complex mechanism required for node configuration, specifically the updating of kernel.perf_event_paranoid.

NVIDIA DevTools Sidecar Injector

NOTICE: NVIDIA DevTools Sidecar Injector has Moved

NVIDIA DevTools Sidecar Injector

Prerequisites

Installation

Installation configuration

Configuration values

Supported placeholders

Enabling profiling on target resources

Existing resources

Control profiling

nsys_k8s

Prerequisites for nsys_k8s

Usage

Nsight Systems commands

download command

check command

Additional configuration options

Updating configuration

Separate configurations

GPU Metrics

Amazon AWS EFA Network Counters

Uninstall

Troubleshooting

General errors

GPU metrics collection error

OpenShift

`download` command

`check` command