# **NOTICE: NVIDIA DevTools Sidecar Injector has Moved** ⚠️ **This project has been integrated into the Nsight Operator and this project is deprecated.** Find the current **Nsight Operator** Helm chart (which now includes the NVIDIA DevTools Sidecar Injector) here: [https://catalog.ngc.nvidia.com/orgs/nvidia/teams/devtools/helm-charts/nsight-operator](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/devtools/helm-charts/nsight-operator) --- ## NVIDIA DevTools Sidecar Injector The NVIDIA DevTools Sidecar Injector enables your containerized applications to be profiled by NVIDIA DevTools applications (currently, only using [Nsight Systems](https://developer.nvidia.com/nsight-systems)). This solution leverages a Kubernetes dynamic admission controller to inject an init container, volumes with the NVIDIA DevTools application and its configurations, environment variables, and a security context upon the creation or update of your Pod. ### Prerequisites - [Docker](https://www.docker.com/) - [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) version v1.19+ - [Helm v3](https://helm.sh/docs/intro/install/). - Access to a Kubernetes v1.19+ cluster with the `admissionregistration.k8s.io/v1` API enabled. Verify that by running the following command: ```bash kubectl api-versions | grep admissionregistration.k8s.io/v1 ``` The result should be: ``` admissionregistration.k8s.io/v1 ``` > Note: Additionally, the `MutatingAdmissionWebhook` and `ValidatingAdmissionWebhook` admission controllers should be > added and listed in the correct order in the admission-control flag of kube-apiserver. Please refer to the > [Kubernetes documentation](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/). > It is likely that this is set by default if your cluster is running on EKS, AKS, OKE or GKE. ### Installation 1. [Configure installation](#installation-configuration) 2. Install the NVIDIA Devtools Sidecar Injector (in this example configuration values were save in `custom_values.yaml`): ```bash helm install -f custom_values.yaml \ devtools-sidecar-injector https://helm.ngc.nvidia.com/nvidia/devtools/charts/devtools-sidecar-injector-1.0.7.tgz ``` #### Installation configuration The NVIDIA DevTools Sidecar can be customized to suit particular needs. Most likely, you will need to configure the profile.devtoolArgs, profile.injectionMatch, profile.volumes, and profile.volumeMounts values. [A values file](https://helm.sh/docs/chart_template_guide/values_files/) can be used for setting these parameters. Sample `custom_values.yaml`. This configuration will enable profiling for any instance of `yourawesomeapp` found in injection Pods. ```yaml # Nsight Systems profiling configuration profile: # The arguments for the Nsight Systems. The placeholders will be replaced with the actual values. devtoolArgs: "profile --start-later true -o /home/auto_{PROCESS_NAME}_%{POD_FULLNAME}_%{CONTAINER_NAME}_{TIMESTAMP}_{UID}.nsys-rep" # The regex to match applications to profile. injectionMatch: "^(?!.*nsys( |$)).*\\byourawesomeapp.*$" ``` Sample `custom_values_launch.yaml`. This configuration will inject Nsight Systems for later profiling for any instance of `yourawesomeapp` found in injection Pods. `nsys_k8s.py` can be used further to start/stop collection. ```yaml # Nsight Systems profiling configuration profile: # The arguments for the Nsight Systems. The placeholders will be replaced with the actual values. devtoolArgs: "launch" # The regex to match applications to profile. injectionMatch: "^(?!.*nsys( |$)).*\\byourawesomeapp.*$" ``` Sample `custom_values_extended.yaml`: ```yaml # Nsight Systems profiling configuration profile: # A volume to store profiling results. It can be omitted, but in this case, the results will be lost after the pod # deletion and they will not be in the common location. # You may skip this section if you already have a shared volume for all the profiling pods. volumes: [ { "name": "nsys-output-volume", "persistentVolumeClaim": { "claimName": "CSP-managed-disk" }, }, ] volumeMounts: [{ "name": "nsys-output-volume", "mountPath": "/mnt/nsys/output" }] # The arguments for the Nsight Systems. The placeholders will be replaced with the actual values. devtoolArgs: "profile --start-later false --duration 20 --kill none -o /mnt/nsys/output/auto_{PROCESS_NAME}_%{POD_FULLNAME}_%{CONTAINER_NAME}_{TIMESTAMP}_{UID}.nsys-rep" # The regex to match applications to profile. injectionMatch: "^(?!.*nsys( |$)).*\\byourawesomeapp.*$" # Node configurations which should be performed. Currently, only kernel.perf_event_paranoid is supported. machineConfig: - name: kernel.perf_event_paranoid value: -1 ``` ##### Configuration values | Variable | Description | Default value | | :------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :------------------------------------------------------------------------------------ | | profile.devtoolArgs | The parameters for Nsight Systems used during profiling are detailed in the [Nsight Systems User Guide.](https://docs.nvidia.com/nsight-systems/UserGuide/index.html)A comprehensive list of available parameters is provided there. Placeholders within these parameters will be substituted with their actual values during execution. It is recommended to include {TIMESTAMP} and {UID} placeholders in the output file name to keep filenames unique. Otherwise, the report may be overwritten or not generated at all. Example: **profile -o /mnt/nsys/output/auto_{PROCESS_NAME}_%{POD_FULLNAME}_%{CONTAINER_NAME}_{TIMESTAMP}_{UID}.nsys-rep** | | profile.injectionMatch | The regex used to match the application that is to be profiled. | `^(?!/bin/)(?!/sbin/)(?!/usr/bin/)(?!/usr/sbin/)(?!.*nsys( \| $))(?!.*cat( \| $)).*$` | | profile.volumes | Additional volumes that will be injected into profiled containers. Can be useful for storing profiling results. | | | profile.volumeMounts | Volume mounts that will be injected into profiled containers. Can be useful for storing profiling results. | | | profile.env | Environment variables that will be injected into profiled containers. | | | sidecarImage.image | NVIDIA DevTools Sidecar image URL can be specified in case of custom registry usage (if the NVIDIA registry is not available). In the case of a private registry, the imagePullSecrets should also be specified. | The default Sidecar nvcr.io URL | | devtoolBinariesImage.image | NVIDIA DevTools Binaries image URL can be specified in the case of custom registry usage (if the NVIDIA registry is not available). In the case of a private registry, the imagePullSecrets should also be specified. | The default Nsight Systems nvcr.io URL | | imagePullSecrets | List of references to secrets within the same namespace for pulling Sidecar and DevTools binaries images. These secrets must be available in all namespaces containing pods that require profiling, as well as in the "nvidia-devtools-sidecar-injector" namespace. | None | | privileged | Enables profiled containers to be run in privileged mode (can be used to collect GPU metrics). | None | | capabilities | Enables profiled containers to be run with specific capabilities (for instance SYS_ADMIN can be used to collect GPU metrics) | None | | machineConfig | Array of name/value pairs (system configurations) which should be updated before profiling on target nodes (currently, only kernel.perf_event_paranoid is supported). [More info about kernel.perf_event_paranoid.](https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html#requirements-for-x86-64-and-arm-sbsa-targets-on-linux)To prevent the NVIDIA DevTools Sidecar Injector from updating node configurations, machineConfig: null in the custom_values.yaml file. | [{ name: kernel.perf_event_paranoid, value: 2 }] | ###### Supported placeholders | Placeholder | Replacement | | :---------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `{UID}` | The random alphanumeric string (8 symbols) | | `{PROCESS_NAME}` | The profiled process name. | | `{PROCESS_ID}` | The profiled process id | | `{TIMESTAMP}` | The UNIX timestamp (in ms) | | `%{ANY ENVIRONMENT VARIABLE}` | The "ANY ENVIRONMENT VARIABLE" environment variable inside a container. POD_FULLNAME and CONTAINER_NAME environment variables are set by the NVIDIA DevTools Sidecar injection | ##### Enabling profiling on target resources To enable automatic Sidecar injection for all Pods in a namespace, add the `nvidia-devtools-sidecar-injector=enabled` label to the namespace. ```bash kubectl label namespaces nvidia-devtools-sidecar-injector=enabled ``` To enable automatic Sidecar injection for a specific resource in a namespace, add the `nvidia-devtools-sidecar-injector=enabled` label to the resource. ```bash kubectl label

nvidia-devtools-sidecar-injector=enabled ``` At this point, any new pod will be considered for injection based on labels and injectionMatch ###### Existing resources An already started pod cannot be injected. Instead you must restart the pod, to support profiling. By the same token if you remove the label or set the Pod label to `disabled`, you will need to restart them to remove the Sidecar injection. 1. Resource with more than one replica ```bash kubectl rollout restart / ``` For example: ```bash kubectl rollout restart deployment/amazing_service ``` 2. Resource with only one replica ```bash kubectl scale / --replicas=0 kubectl scale / --replicas=1 ``` For example: ```bash kubectl scale deployment/amazing_service --replicas=0 kubectl scale deployment/amazing_service --replicas=1 ``` ### Control profiling Profiling can be controlled using the `nsys_k8s.py` script. The script can be found in [NVIDIA DevTools Sidecar Injector Resources](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/devtools/resources/devtools-sidecar-injector-resources). #### nsys_k8s This script facilitates the execution of [Nsight Systems commands](https://docs.nvidia.com/nsight-systems/UserGuide/index.html) within profiled containers of Kubernetes pods. Additionally, it provides a convenient method for downloading profiling result. `nsys_k8s` searches for Pods that are labeled for profiling and looks for active Nsight Systems sessions launched by the Sidecar in them. The script supports Pods filtering using [field selectors](https://kubernetes.io/docs/concepts/overview/working-with-objects/field-selectors/) ##### Prerequisites for nsys_k8s - Python 3.6 or higher - Python dependencies installed (`pip install -r requirements.txt`) ##### Usage ###### Nsight Systems commands The script supports executing Nsight Systems commands within containers of Kubernetes pods, with optional filters for targeting specific namespaces, containers, and pods. Nsight Systems commands are executed only on pods that have active Nsight Systems sessions. The general command structure is as follows: ```bash ./nsys_k8s.py [--field-selector SELECTOR] nsys [nsys_arguments...] ``` | Argument | Description | | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `--field-selector` | (Optional) Filter Kubernetes objects to identify those on which an Nsight Systems command will be executed, based on the value(s) of one or more resource fields. [Field selectors.](https://kubernetes.io/docs/concepts/overview/working-with-objects/field-selectors/) | | `nsys_arguments...` | [Specify the Nsight Systems command and arguments](https://docs.nvidia.com/nsight-systems/UserGuide/index.html)you wish to execute. For example, start --sampling-frequency=5000. For commands which supports the --output argument, in case this argument is not present, the --output arguments will be generated based on profile.devtoolArgs Helm option value | Do not specify the session name in `nsys_arguments` - it will be obtained automatically. ###### `download` command The script supports the `download` command to provide a convenient way for downloading profiling results from profiled Pods. ```bash ./nsys_k8s.py [--field-selector SELECTOR] download [destination] ``` | Argument | Description | | ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `--field-selector` | (Optional) Filter Kubernetes objects to identify those on which an Nsight Systems command will be executed, based on the value(s) of one or more resource fields. [Field selectors](https://kubernetes.io/docs/concepts/overview/working-with-objects/field-selectors/). | | `destination` | The path for the directory into which the profiling results will be downloaded. | | `--remove-source` | (Optional) Delete source files from Pods after downloading them. | ###### `check` command The script supports the `check` command to provide a convenient way to check if a NVIDIA DevTools Sidecar Injector is injected into a specific Pod. ```bash ./nsys_k8s.py check [-n namespace] [pod] ``` | Argument | Description | | -------- | --------------------------------------------- | | `-n` | (Optional) The namespace of the Pod to check. | | `pod` | The name of the Pod to check. | ### Additional configuration options #### Updating configuration Sidecar Injector configurations can be modified after the installation. Please note, however, that the configuration of already injected Pods will not be updated until they are restarted. ```bash helm upgrade -f custom_values.yaml \ devtools-sidecar-injector https://helm.ngc.nvidia.com/nvidia/devtools/charts/devtools-sidecar-injector-1.0.7.tgz ``` #### Separate configurations Sidecar Injector configurations can be customized for an individual namespace/pods. For doing that a ConfigMap with name `nvidia-devtools-sidecar-injector-custom` can be used. Sample `separate_configmap.yaml`: ```yaml apiVersion: v1 kind: ConfigMap metadata: name: nvidia-devtools-sidecar-injector-custom labels: app: nvidia-devtools-sidecar-injector data: injectionconfig.yaml: | { "devtoolArgs": "profile -o /mnt/nsys/output/auto_{PROCESS_NAME}_%{POD_FULLNAME}_%{CONTAINER_NAME}_{TIMESTAMP}_{UID}.nsys-rep", "injectionMatch": "^(?!.*nsys( |$)).*\byourotherawesomeapp.*$" } ``` #### GPU Metrics [GPU Metrics Samples](https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gpu-metrics) can only be collected by one process per GPU. The most straightforward way to avoid collisions is to collect GPU metrics from a single custom DaemonSet per node. The following resources configuration can be used to achieve that: ```bash kubectl apply -f ./gpu_metrics_resources.yaml ``` `gpu_metrics_daemonset.yaml`: ```yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: gpu-metrics-collector namespace: example-gpu-metrics-ns labels: nvidia-devtools-sidecar-injector: enabled spec: template: spec: containers: - name: gpu-metrics-ubuntu-container image: ubuntu:22.04 command: ["sleep", "infinity"] securityContext: privileged: true tolerations: - effect: NoSchedule key: nvidia.com/gpu operator: Exists --- apiVersion: v1 kind: ConfigMap metadata: name: nvidia-devtools-sidecar-injector-custom namespace: example-gpu-metrics-ns labels: app: nvidia-devtools-sidecar-injector data: injectionconfig.yaml: | { "devtoolArgs": "profile --start-later false --gpu-metrics-device=all -o /mnt/nsys/output/auto_gpu_metrics_%{POD_FULLNAME}_{TIMESTAMP}_{UID}.nsys-rep", "injectionMatch": "^sleep infinity$" } ``` The ConfigMap customizes profiling parameters (which ensure the GPU Metrics are collected) for the DaemonSet. Started by this DaemonSet Pod will be controllable by the `nsys_k8s.py` script. ### Amazon AWS EFA Network Counters [Amazon AWS EFA Network Counters](Amazon AWS EFA Network Counters) requires additional configuration to be sampled. The /sys/class/infiniband//ports/*/hw_counters/ directory is not mounted into a container by default, so it should be mounted into the container from the host machine. Sample `custom_values_efa_mount.yaml` with the required volumes: ```yaml profile: # Files inside /sys/class/infiniband directory contain relative symbolic links to /sys/devices volumes: [ { "name": "sys-class-infiniband", "hostPath": { "path": "/sys/class/infiniband", "type": "Directory" } }, { "name": "sys-class-devices", "hostPath": { "path": "/sys/devices", "type": "Directory" } } ] volumeMounts: [ { "name": "sys-class-infiniband", "mountPath": "/mnt/nv/sys/class/infiniband", "readOnly": true }, { "name": "sys-class-infiniband", "mountPath": "/mnt/nv/sys/devices", "readOnly": true } ] # Enable and configure the EFA metrics plugin to collect metrics from a non-default sysfs location. devtoolArgs: "profile --enable efa_metrics,-efa-counters-sysfs=\"/mnt/nv/sys\" -o /home/auto_{PROCESS_NAME}_%{POD_FULLNAME}_%{CONTAINER_NAME}_{TIMESTAMP}_{UID}.nsys-rep" # The regex to match applications to profile. injectionMatch: "^/usr/bin/python3 /usr/local/bin/torchrun.*$" ``` ### Uninstall Perform the following steps to uninstall the NVIDIA Devtools Sidecar Injector: ```helm uninstall devtools-sidecar-injector``` This will automatically delete all the resources created by `devtools-sidecar-injector` and remove all the `nvidia-devtools-sidecar-injector` labels from all the labeled resources. Additionally, you can delete only the labels from all resources labeled with `nvidia-devtools-sidecar-injector=enabled` to clean up the resources from injection: ```bash kubectl get all --all-namespaces -l nvidia-devtools-sidecar-injector=enabled -o custom-columns=:.metadata.name,NS:.metadata.namespace,KIND:.kind --no-headers | while read name namespace kind; do kubectl label $kind $name -n $namespace nvidia-devtools-sidecar-injector-; done ``` ### Troubleshooting #### General errors Sometimes you may find that pod is injected with sidecar container as expected, check the following items: 1. The `nvidia-devtools-sidecar-injector` in the `nvidia-devtools-sidecar-injector` namespace Pod is in running state and no error logs have been produced. 2. Check that the target Pod was correctly injected: `./nsys_k8s.py check [-n namespace] [pod]` #### GPU metrics collection error 1. Check that no other applications collect GPU metrics on a target Pod. For example it can be: - Other injection with the enabled `--gpu-metrics-device` option. In that case, you can use a report from that injection or modify the configurations to ensure only one Pod is running with the GPU metrics option. - If you have a [GPU operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) installed, it has a `nvidia-dcgm-exporter` ([documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html)) DaemonSet which collects GPU metrics. If you are not using it, you can temporary disable it: ```bash kubectl -n gpu-operator patch daemonset nvidia-dcgm-exporter -p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}' ``` To enable it back, you can call the command: ```bash kubectl -n gpu-operator patch daemonset nvidia-dcgm-exporter --type json -p='[{"op": "remove", "path": "/spec/template/spec/nodeSelector/non-existing"}]' ``` #### OpenShift Initialization of Pods with the profiler injected can be slower on OpenShift clusters during the first-time setup (post-configuration). This is due to the more complex mechanism required for node configuration, specifically the updating of kernel.perf_event_paranoid.