NVIDIA DevTools Sidecar Injector

NVIDIA DevTools Sidecar Injector

Logo for NVIDIA DevTools Sidecar Injector
Description
The NVIDIA DevTools Sidecar enables your containerized applications to be profiled by NVIDIA DevTools applications (currently, only using NVIDIA Nsight Systems).
Publisher
NVIDIA
Latest Version
1.0.0
Compressed Size
6.02 KB
Modified
March 14, 2024

NVIDIA DevTools Sidecar Injector

The NVIDIA DevTools Sidecar Injector enables your containerized applications to be profiled by NVIDIA DevTools applications (currently, only using Nsight Systems). This solution leverages a Kubernetes dynamic admission controller to inject an init container, volumes with the NVIDIA DevTools application and its configurations, environment variables, and a security context upon the creation or update of your Pod.

Prerequisites

  • Docker
  • kubectl version v1.19+
  • Helm v3.
  • Access to a Kubernetes v1.19+ cluster with the admissionregistration.k8s.io/v1 API enabled. Verify that by running the following command:
kubectl api-versions | grep admissionregistration.k8s.io/v1

The result should be:

admissionregistration.k8s.io/v1

Note: Additionally, the MutatingAdmissionWebhook and ValidatingAdmissionWebhook admission controllers should be added and listed in the correct order in the admission-control flag of kube-apiserver. Please refer to the Kubernetes documentation. It is likely that this is set by default if your cluster is running on EKS, AKS, OKE or GKE.

Installation

  1. Configure installation

  2. Install the NVIDIA Devtools Sidecar Injector (in this example configuration values were save in custom_values.yaml):

helm install -f custom_values.yaml \
    devtools-sidecar-injector https://helm.ngc.nvidia.com/nvidia/devtools/charts/devtools-sidecar-injector-1.0.0.tgz

Installation configuration

The NVIDIA DevTools Sidecar can be customized to suit particular needs. Most likely, you will need to configure the profile.devtoolArgs, profile.injectionMatch, profile.volumes, and profile.volumeMounts values. A values file can be used for setting these parameters.

Sample custom_values.yaml. This configuration will enable profiling for any instance of yourawesomeapp found in injection Pods.


# Nsight Systems profiling configuration
profile:
  # The arguments for the Nsight Systems. The placeholders will be replaced with the actual values.
  devtoolArgs: "profile -f true --start-later true --trace nvtx,cuda -o /home/auto_{PROCESS_NAME}_%{POD_FULLNAME}_%{CONTAINER_NAME}_{TIMESTAMP}_{UID}.nsys-rep"
  # The regex to match applications to profile.
  injectionMatch: "^(?!.*nsys( |$)).*\\byourawesomeapp.*$"

Sample custom_values_launch.yaml. This configuration will inject Nsight Systems for later profiling for any instance of yourawesomeapp found in injection Pods. nsys_k8s.py can be used further to start/stop collection.


# Nsight Systems profiling configuration
profile:
  # The arguments for the Nsight Systems. The placeholders will be replaced with the actual values.
  devtoolArgs: "launch --trace nvtx,cuda"
  # The regex to match applications to profile.
  injectionMatch: "^(?!.*nsys( |$)).*\\byourawesomeapp.*$"

Sample custom_values_extended.yaml:


# Nsight Systems profiling configuration
profile:
  # A volume to store profiling results. It can be omitted, but in this case, the results will be lost after the pod
  # deletion and they will not be in the common location.
  # You may skip this section if you already have a shared volume for all the profiling pods.
  volumes:
    [
      {
        "name": "nsys-output-volume",
        "persistentVolumeClaim": { "claimName": "CSP-managed-disk" },
      },
    ]
  volumeMounts:
    [{ "name": "nsys-output-volume", "mountPath": "/mnt/nsys/output" }]
  # The arguments for the Nsight Systems. The placeholders will be replaced with the actual values.
  devtoolArgs: "profile -f true --start-later false --duration 20 --kill none --backtrace dwarf --trace nvtx,cuda -o /mnt/nsys/output/auto_{PROCESS_NAME}_%{POD_FULLNAME}_%{CONTAINER_NAME}_{TIMESTAMP}_{UID}.nsys-rep"
  # The regex to match applications to profile.
  injectionMatch: "^(?!.*nsys( |$)).*\\byourawesomeapp.*$"
Configuration values
Variable Description Default value
profile.devtoolArgs The parameters for Nsight Systems used during profiling are detailed in the Nsight Systems User Guide. A comprehensive list of available parameters is provided there. Placeholders within these parameters will be substituted with their actual values during execution. It is recommended to include {TIMESTAMP} and {UID} placeholders in the output file name to keep filenames unique. Otherwise, the report may be overwritten or not generated at all. Example: profile -f true --trace nvtx,cuda -o /mnt/nsys/output/auto_{PROCESS_NAME}_%{POD_FULLNAME}_%{CONTAINER_NAME}_{TIMESTAMP}_{UID}.nsys-rep
profile.injectionMatch The regex used to match the application that is to be profiled. ^(?!/bin/)(?!/sbin/)(?!/usr/bin/)(?!/usr/sbin/)(?!.*nsys( | $))(?!.*cat( | $)).*$
profile.volumes Additional volumes that will be injected into profiled containers. May be useful for storing profiling results.
profile.volumeMounts Volume mounts that will be injected into profiled containers. theyay be useful for storing profiling results.
sidecarImage.image NVIDIA DevTools Sidecar image URL can be specified in case of custom registry usage (if the NVIDIA registry is not available). In the case of a private registry, the imagePullSecrets should also be specified. The default Sidecar nvcr.io URL
devtoolBinariesImage.image NVIDIA DevTools Binaries image URL can be specified in the case of custom registry usage (if the NVIDIA registry is not available). In the case of a private registry, the imagePullSecrets should also be specified. The default Nsight Systems nvcr.io URL
imagePullSecrets List of references to secrets within the same namespace for pulling Sidecar and DevTools binaries images. These secrets must be available in all namespaces containing pods that require profiling, as well as in the "nvidia-devtools-sidecar-injector" namespace. None
privileged Enables profiled containers to be run in privileged mode (can be used to collect GPU metrics). None
capabilities Enables profiled containers to be run with specific capabilities (fox isntance SYS_ADMIN can be used to collect GPU metrics) None
Supported placeholders
Placeholder Replacement
{UID} The random alphanumeric string (8 symbols)
{PROCESS_NAME} The profiled process name.
{PROCESS_ID} The profiled process id
{TIMESTAMP} The UNIX timestamp (in ms)
%{ANY ENVIRONMENT VARIABLE} The ANY ENVIRONMENT VARIABLE environment variable inside a container. POD_FULLNAME and CONTAINER_NAME environment variables are set by the NVIDIA DevTools Sidecar injection
Enabling profiling on target resources

To enable automatic Sidecar injection for all Pods in a namespace, add the nvidia-devtools-sidecar-injector=enabled label to the namespace.

kubectl label namespaces <namespace name> nvidia-devtools-sidecar-injector=enabled

To enable automatic Sidecar injection for a specific resource in a namespace, add the nvidia-devtools-sidecar-injector=enabled label to the resource.

kubectl label <resource_tyoe> <pod-name> nvidia-devtools-sidecar-injector=enabled

At this point, any new pod will be considered for injection based on labels and injectionMatch

Existing resources

An already started pod cannot be injected. Instead you must restart the pod, to support profiling. By the same token if you remove the label or set the Pod label to disabled, you will need to restart them to remove the Sidecar injection.

####### Resource with more than one replica

kubectl rollout restart <resource type>/<resource name>

For example:

kubectl rollout restart deployment/amazing_service
Resource with only one replica
kubectl scale <resource type>/<resource name> --replicas=0
kubectl scale <resource type>/<resource name> --replicas=1

For example:

kubectl scale deployment/amazing_service --replicas=0
kubectl scale deployment/amazing_service --replicas=1

Control profiling

Profiling can be controlled using the nsys_k8s.py script. The script can be found in NVIDIA DevTools Sidecar Injector Resources.

nsys_k8s

This script facilitates the execution of Nsight Systems commands within profiled containers of Kubernetes pods. Additionally, it provides a convinient method for downloading profiling result. nsys_k8s searches for Pods that are labeled for profiling and looks for active Nsight Systems sessions launched by the Sidecar in them. The script supports Pods filtering using field selectors

Prerequisites for nsys_k8s
  • Python 3.6 or higher
  • Python dependencies installed (pip install -r requirements.txt)
Usage
Nsight Systems commands

The script supports executing Nsight Systems commands within containers of Kubernetes pods, with optional filters for targeting specific namespaces, containers, and pods. Nsight Systems commands are executed only on pods that have active Nsight Systems sessions. The general command structure is as follows:

./nsys_k8s.py [--field-selector SELECTOR] nsys [nsys_arguments...]
Argument Description
--field-selector (Optional) Filter Kubernetes objects to identify those on which an Nsight Systems command will be executed, based on the value(s) of one or more resource fields. Field selectors.
nsys_arguments... Specify the Nsight Systems command and arguments you wish to execute. For example, start --sampling-frequency=5000. For commands which supports the --output argument, in case this argument is not present, the --output arguments will be generated based on profile.devtoolArgs Helm option value

Do not specify the session name in nsys_arguments - it will be obtained atomatically.

download command

The script supports the download command to provide a convinient way for downloading profiling results from profiled Pods.

./nsys_k8s.py [--field-selector SELECTOR] download [destination]
Argument Description
--field-selector (Optional) Filter Kubernetes objects to identify those on which an Nsight Systems command will be executed, based on the value(s) of one or more resource fields. Field selectors.
destination The path for the directory into which the profiling results will be downloaded.
--remove-source (Optional) Delete source files from Pods after downloading them.
check command

The script supports the check command to provide a convinient way to check if a NVIDIA DevTools Sidecar Injector is injected into a specific Pod.

./nsys_k8s.py check [-n namespace] [pod]
Argument Description
-n (Optional) The namespace of the Pod to check.
pod The name of the Pod to check.

Additional configuration options

Updaing configuration

Sidecar Injector configurations can be modified after the installation. Please note, however, that the configuration of already injected Pods will not be updated until they are restarted and ConfigMaps are not deleted from their namespaces (kubectl delete cm -n <namespace name> nvidia-devtools-sidecar-injector-custom).

helm upgrade -f custom_values.yaml \
    devtools-sidecar-injector https://helm.ngc.nvidia.com/nvidia/devtools/charts/devtools-sidecar-injector-1.0.0.tgz

Separate configurations

Sidecar Injector configurations can be customized for an individual namespace/pods. For doing that a ConfigMap with name nvidia-devtools-sidecar-injector-custom can be used.

Sample separate_configmap.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-devtools-sidecar-injector-custom
  labels:
    app: nvidia-devtools-sidecar-injector
data:
  injectionconfig.yaml: |
    {
      "devtoolArgs": "profile -f true --trace cuda  -o /mnt/nsys/output/auto_{PROCESS_NAME}_%{POD_FULLNAME}_%{CONTAINER_NAME}_{TIMESTAMP}_{UID}.nsys-rep",
      "injectionMatch": "^(?!.*nsys( |$)).*\byourotherawesomeapp.*$"
    }

GPU Metrics

GPU Metrics Samples can only be collected by one process per GPU. The most straightforward way to avoid collisions is to collect GPU metrics from a single custom DaemonSet per node. The following resources configuration can be used to achive that:

kubectl apply -f ./gpu_metrics_resources.yaml

gpu_metrics_daemonset.yaml:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gpu-metrics-collector
  namespace: example-gpu-metrics-ns
  labels:
    nvidia-devtools-sidecar-injector: enabled
spec:
  template:
    spec:
      containers:
      - name: gpu-metrics-ubuntu-container
        image: ubuntu:22.04
        command: ["sleep", "infinity"]
        securityContext:
          privileged: true
      tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-devtools-sidecar-injector-custom
  namespace: example-gpu-metrics-ns
  labels:
    app: nvidia-devtools-sidecar-injector
data:
  injectionconfig.yaml: |
    {
      "devtoolArgs": "profile -f true --start-later false --gpu-metrics-device=all -s system-wide -o /mnt/nsys/output/auto_gpu_metrics_%{POD_FULLNAME}_{TIMESTAMP}_{UID}.nsys-rep",
      "injectionMatch": "^sleep infinity$"
    }

The ConfigMap customizes profiling parameters (which ensure the GPU Metrics are collected) for the DaemonSet. Started by this DaemonSet Pod will be controllable by the nsys_k8s.py script.

Uninstall

Perform the following steps to uninstall the NVIDIA Devtools Sidecar Injector:

helm uninstall devtools-sidecar-injector

This will not automatically delete some resources, so they should be deleted manually. Replace <namespace name> with the namespace where profiled Pods are running:

kubectl delete mutatingwebhookconfiguration nvidia-devtools-sidecar-injector-webhook
kubectl delete cm -n <namespace name> nvidia-devtools-sidecar-injector
kubectl delete cm -n <namespace name> nvidia-devtools-sidecar-injector-custom

Additionally, you can delete labels from all labeled with nvidia-devtools-sidecar-injector=enabled resources:

kubectl get all --all-namespaces -l nvidia-devtools-sidecar-injector=enabled -o custom-columns=:.metadata.name,NS:.metadata.namespace,KIND:.kind --no-headers | while read name namespace kind; do kubectl label $kind $name -n $namespace nvidia-devtools-sidecar-injector-; done

Troubleshooting

General errors

Sometimes you may find that pod is injected with sidecar container as expected, check the following items:

  1. The nvidia-devtools-sidecar-injector in the nvidia-devtools-sidecar-injector namespace Pod is in running state and no error logs have been produced.
  2. Check that the target Pod was correctly injected: ./nsys_k8s.py check [-n namespace] [pod]

GPU metrics collection error

  1. Check that no other applications collect GPU metrics on a target Pod. For example it can be:
    • Other injection with the enabled --gpu-metrics-device option. In that case, you can use a report from that injection or modify the configurations to ensure only one Pod is running with the GPU metrics option.
    • If you have a GPU operator installed, it has a nvidia-dcgm-exporter (documentation) DaemonSet which collects GPU metrics. If you are not using it, you can temporary disable it:
kubectl -n gpu-operator patch daemonset nvidia-dcgm-exporter -p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'

To enable it back, you can call the command:

kubectl -n gpu-operator patch daemonset nvidia-dcgm-exporter --type json -p='[{"op": "remove", "path": "/spec/template/spec/nodeSelector/non-existing"}]'