The NVIDIA DevTools Sidecar Injector enables your containerized applications to be profiled by NVIDIA DevTools applications (currently, only using Nsight Systems). This solution leverages a Kubernetes dynamic admission controller to inject an init container, volumes with the NVIDIA DevTools application and its configurations, environment variables, and a security context upon the creation or update of your Pod.
admissionregistration.k8s.io/v1
API enabled. Verify that by running
the following command:kubectl api-versions | grep admissionregistration.k8s.io/v1
The result should be:
admissionregistration.k8s.io/v1
Note: Additionally, the
MutatingAdmissionWebhook
andValidatingAdmissionWebhook
admission controllers should be added and listed in the correct order in the admission-control flag of kube-apiserver. Please refer to the Kubernetes documentation. It is likely that this is set by default if your cluster is running on EKS, AKS, OKE or GKE.
custom_values.yaml
):helm install -f custom_values.yaml \
devtools-sidecar-injector https://helm.ngc.nvidia.com/nvidia/devtools/charts/devtools-sidecar-injector-1.0.7.tgz
The NVIDIA DevTools Sidecar can be customized to suit particular needs. Most likely, you will need to configure the profile.devtoolArgs, profile.injectionMatch, profile.volumes, and profile.volumeMounts values. A values file can be used for setting these parameters.
Sample custom_values.yaml
. This configuration will enable profiling for any instance of yourawesomeapp
found in
injection Pods.
# Nsight Systems profiling configuration
profile:
# The arguments for the Nsight Systems. The placeholders will be replaced with the actual values.
devtoolArgs: "profile --start-later true -o /home/auto_{PROCESS_NAME}_%{POD_FULLNAME}_%{CONTAINER_NAME}_{TIMESTAMP}_{UID}.nsys-rep"
# The regex to match applications to profile.
injectionMatch: "^(?!.*nsys( |$)).*\\byourawesomeapp.*$"
Sample custom_values_launch.yaml
. This configuration will inject Nsight Systems for later profiling for any
instance of yourawesomeapp
found in injection Pods. nsys_k8s.py
can be used further to start/stop collection.
# Nsight Systems profiling configuration
profile:
# The arguments for the Nsight Systems. The placeholders will be replaced with the actual values.
devtoolArgs: "launch"
# The regex to match applications to profile.
injectionMatch: "^(?!.*nsys( |$)).*\\byourawesomeapp.*$"
Sample custom_values_extended.yaml
:
# Nsight Systems profiling configuration
profile:
# A volume to store profiling results. It can be omitted, but in this case, the results will be lost after the pod
# deletion and they will not be in the common location.
# You may skip this section if you already have a shared volume for all the profiling pods.
volumes:
[
{
"name": "nsys-output-volume",
"persistentVolumeClaim": { "claimName": "CSP-managed-disk" },
},
]
volumeMounts:
[{ "name": "nsys-output-volume", "mountPath": "/mnt/nsys/output" }]
# The arguments for the Nsight Systems. The placeholders will be replaced with the actual values.
devtoolArgs: "profile --start-later false --duration 20 --kill none -o /mnt/nsys/output/auto_{PROCESS_NAME}_%{POD_FULLNAME}_%{CONTAINER_NAME}_{TIMESTAMP}_{UID}.nsys-rep"
# The regex to match applications to profile.
injectionMatch: "^(?!.*nsys( |$)).*\\byourawesomeapp.*$"
# Node configurations which should be performed. Currently, only kernel.perf_event_paranoid is supported.
machineConfig:
- name: kernel.perf_event_paranoid
value: -1
Variable | Description | Default value |
---|---|---|
profile.devtoolArgs | The parameters for Nsight Systems used during profiling are detailed in the Nsight Systems User Guide.A comprehensive list of available parameters is provided there. Placeholders within these parameters will be substituted with their actual values during execution. It is recommended to include {TIMESTAMP} and {UID} placeholders in the output file name to keep filenames unique. Otherwise, the report may be overwritten or not generated at all. Example: profile -o /mnt/nsys/output/auto_{PROCESS_NAME}%{POD_FULLNAME}%{CONTAINER_NAME}{TIMESTAMP}{UID}.nsys-rep | |
profile.injectionMatch | The regex used to match the application that is to be profiled. | ^(?!/bin/)(?!/sbin/)(?!/usr/bin/)(?!/usr/sbin/)(?!.*nsys( | $))(?!.*cat( | $)).*$ |
profile.volumes | Additional volumes that will be injected into profiled containers. Can be useful for storing profiling results. | |
profile.volumeMounts | Volume mounts that will be injected into profiled containers. Can be useful for storing profiling results. | |
profile.env | Environment variables that will be injected into profiled containers. | |
sidecarImage.image | NVIDIA DevTools Sidecar image URL can be specified in case of custom registry usage (if the NVIDIA registry is not available). In the case of a private registry, the imagePullSecrets should also be specified. | The default Sidecar nvcr.io URL |
devtoolBinariesImage.image | NVIDIA DevTools Binaries image URL can be specified in the case of custom registry usage (if the NVIDIA registry is not available). In the case of a private registry, the imagePullSecrets should also be specified. | The default Nsight Systems nvcr.io URL |
imagePullSecrets | List of references to secrets within the same namespace for pulling Sidecar and DevTools binaries images. These secrets must be available in all namespaces containing pods that require profiling, as well as in the "nvidia-devtools-sidecar-injector" namespace. | None |
privileged | Enables profiled containers to be run in privileged mode (can be used to collect GPU metrics). | None |
capabilities | Enables profiled containers to be run with specific capabilities (for instance SYS_ADMIN can be used to collect GPU metrics) | None |
machineConfig | Array of name/value pairs (system configurations) which should be updated before profiling on target nodes (currently, only kernel.perf_event_paranoid is supported). More info about kernel.perf_event_paranoid.To prevent the NVIDIA DevTools Sidecar Injector from updating node configurations, machineConfig: null in the custom_values.yaml file. | [{ name: kernel.perf_event_paranoid, value: 2 }] |
Placeholder | Replacement |
---|---|
{UID} |
The random alphanumeric string (8 symbols) |
{PROCESS_NAME} |
The profiled process name. |
{PROCESS_ID} |
The profiled process id |
{TIMESTAMP} |
The UNIX timestamp (in ms) |
%{ANY ENVIRONMENT VARIABLE} |
The "ANY ENVIRONMENT VARIABLE" environment variable inside a container. POD_FULLNAME and CONTAINER_NAME environment variables are set by the NVIDIA DevTools Sidecar injection |
To enable automatic Sidecar injection for all Pods in a namespace, add the nvidia-devtools-sidecar-injector=enabled
label to the namespace.
kubectl label namespaces <namespace name> nvidia-devtools-sidecar-injector=enabled
To enable automatic Sidecar injection for a specific resource in a namespace, add the
nvidia-devtools-sidecar-injector=enabled
label to the resource.
kubectl label <resource_tyoe> <pod-name> nvidia-devtools-sidecar-injector=enabled
At this point, any new pod will be considered for injection based on labels and injectionMatch
An already started pod cannot be injected. Instead you must restart the pod, to support profiling.
By the same token if you remove the label or set the Pod label to disabled
, you will need to restart them to remove
the Sidecar injection.
Resource with more than one replica
kubectl rollout restart <resource type>/<resource name>
For example:
kubectl rollout restart deployment/amazing_service
Resource with only one replica
kubectl scale <resource type>/<resource name> --replicas=0
kubectl scale <resource type>/<resource name> --replicas=1
For example:
kubectl scale deployment/amazing_service --replicas=0
kubectl scale deployment/amazing_service --replicas=1
Profiling can be controlled using the nsys_k8s.py
script. The script can be found in
NVIDIA DevTools Sidecar Injector Resources.
This script facilitates the execution of
Nsight Systems commands within profiled containers of
Kubernetes pods. Additionally, it provides a convenient method for downloading profiling result.
nsys_k8s
searches for Pods that are labeled for profiling and looks for active Nsight Systems sessions launched by
the Sidecar in them.
The script supports Pods filtering using
field selectors
pip install -r requirements.txt
)The script supports executing Nsight Systems commands within containers of Kubernetes pods, with optional filters for targeting specific namespaces, containers, and pods. Nsight Systems commands are executed only on pods that have active Nsight Systems sessions. The general command structure is as follows:
./nsys_k8s.py [--field-selector SELECTOR] nsys [nsys_arguments...]
Argument | Description |
---|---|
--field-selector |
(Optional) Filter Kubernetes objects to identify those on which an Nsight Systems command will be executed, based on the value(s) of one or more resource fields. Field selectors. |
nsys_arguments... |
Specify the Nsight Systems command and argumentsyou wish to execute. For example, start --sampling-frequency=5000. For commands which supports the --output argument, in case this argument is not present, the --output arguments will be generated based on profile.devtoolArgs Helm option value |
Do not specify the session name in nsys_arguments
- it will be obtained automatically.
download
commandThe script supports the download
command to provide a convenient way for downloading profiling results from profiled Pods.
./nsys_k8s.py [--field-selector SELECTOR] download [destination]
Argument | Description |
---|---|
--field-selector |
(Optional) Filter Kubernetes objects to identify those on which an Nsight Systems command will be executed, based on the value(s) of one or more resource fields. Field selectors. |
destination |
The path for the directory into which the profiling results will be downloaded. |
--remove-source |
(Optional) Delete source files from Pods after downloading them. |
check
commandThe script supports the check
command to provide a convenient way to check if a NVIDIA DevTools Sidecar Injector
is injected into a specific Pod.
./nsys_k8s.py check [-n namespace] [pod]
Argument | Description |
---|---|
-n |
(Optional) The namespace of the Pod to check. |
pod |
The name of the Pod to check. |
Sidecar Injector configurations can be modified after the installation. Please note, however, that the configuration of already injected Pods will not be updated until they are restarted.
helm upgrade -f custom_values.yaml \
devtools-sidecar-injector https://helm.ngc.nvidia.com/nvidia/devtools/charts/devtools-sidecar-injector-1.0.7.tgz
Sidecar Injector configurations can be customized for an individual namespace/pods. For doing that a ConfigMap with
name nvidia-devtools-sidecar-injector-custom
can be used.
Sample separate_configmap.yaml
:
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-devtools-sidecar-injector-custom
labels:
app: nvidia-devtools-sidecar-injector
data:
injectionconfig.yaml: |
{
"devtoolArgs": "profile -o /mnt/nsys/output/auto_{PROCESS_NAME}_%{POD_FULLNAME}_%{CONTAINER_NAME}_{TIMESTAMP}_{UID}.nsys-rep",
"injectionMatch": "^(?!.*nsys( |$)).*\byourotherawesomeapp.*$"
}
GPU Metrics Samples can only be collected by one process per GPU. The most straightforward way to avoid collisions is to collect GPU metrics from a single custom DaemonSet per node. The following resources configuration can be used to achieve that:
kubectl apply -f ./gpu_metrics_resources.yaml
gpu_metrics_daemonset.yaml
:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: gpu-metrics-collector
namespace: example-gpu-metrics-ns
labels:
nvidia-devtools-sidecar-injector: enabled
spec:
template:
spec:
containers:
- name: gpu-metrics-ubuntu-container
image: ubuntu:22.04
command: ["sleep", "infinity"]
securityContext:
privileged: true
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
---
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-devtools-sidecar-injector-custom
namespace: example-gpu-metrics-ns
labels:
app: nvidia-devtools-sidecar-injector
data:
injectionconfig.yaml: |
{
"devtoolArgs": "profile --start-later false --gpu-metrics-device=all -o /mnt/nsys/output/auto_gpu_metrics_%{POD_FULLNAME}_{TIMESTAMP}_{UID}.nsys-rep",
"injectionMatch": "^sleep infinity$"
}
The ConfigMap customizes profiling parameters (which ensure the GPU Metrics are collected) for the DaemonSet. Started
by this DaemonSet Pod will be controllable by the nsys_k8s.py
script.
[Amazon AWS EFA Network Counters](Amazon AWS EFA Network Counters) requires additional configuration to be sampled. The /sys/class/infiniband//ports/*/hw_counters/ directory is not mounted into a container by default, so it should be mounted into the container from the host machine.
Sample custom_values_efa_mount.yaml
with the required volumes:
profile:
# Files inside /sys/class/infiniband directory contain relative symbolic links to /sys/devices
volumes:
[
{
"name": "sys-class-infiniband",
"hostPath": { "path": "/sys/class/infiniband", "type": "Directory" }
},
{
"name": "sys-class-devices",
"hostPath": { "path": "/sys/devices", "type": "Directory" }
}
]
volumeMounts:
[
{
"name": "sys-class-infiniband",
"mountPath": "/mnt/nv/sys/class/infiniband",
"readOnly": true
},
{
"name": "sys-class-infiniband",
"mountPath": "/mnt/nv/sys/devices",
"readOnly": true
}
]
# Enable and configure the EFA metrics plugin to collect metrics from a non-default sysfs location.
devtoolArgs: "profile --enable efa_metrics,-efa-counters-sysfs=\"/mnt/nv/sys\" -o /home/auto_{PROCESS_NAME}_%{POD_FULLNAME}_%{CONTAINER_NAME}_{TIMESTAMP}_{UID}.nsys-rep"
# The regex to match applications to profile.
injectionMatch: "^/usr/bin/python3 /usr/local/bin/torchrun.*$"
Perform the following steps to uninstall the NVIDIA Devtools Sidecar Injector:
helm uninstall devtools-sidecar-injector
This will automatically delete all the resources created by devtools-sidecar-injector
and remove all the nvidia-devtools-sidecar-injector
labels from all the labeled resources.
Additionally, you can delete only the labels from all resources labeled with nvidia-devtools-sidecar-injector=enabled
to clean up the resources from injection:
kubectl get all --all-namespaces -l nvidia-devtools-sidecar-injector=enabled -o custom-columns=:.metadata.name,NS:.metadata.namespace,KIND:.kind --no-headers | while read name namespace kind; do kubectl label $kind $name -n $namespace nvidia-devtools-sidecar-injector-; done
Sometimes you may find that pod is injected with sidecar container as expected, check the following items:
nvidia-devtools-sidecar-injector
in the nvidia-devtools-sidecar-injector
namespace Pod is in running state
and no error logs have been produced../nsys_k8s.py check [-n namespace] [pod]
--gpu-metrics-device
option. In that case, you can use a report from that
injection or modify the configurations to ensure only one Pod is running with the GPU metrics option.nvidia-dcgm-exporter
(documentation)
DaemonSet which collects GPU metrics. If you are not using it, you can temporary disable it:kubectl -n gpu-operator patch daemonset nvidia-dcgm-exporter -p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'
To enable it back, you can call the command:
kubectl -n gpu-operator patch daemonset nvidia-dcgm-exporter --type json -p='[{"op": "remove", "path": "/spec/template/spec/nodeSelector/non-existing"}]'
Initialization of Pods with the profiler injected can be slower on OpenShift clusters during the first-time setup (post-configuration). This is due to the more complex mechanism required for node configuration, specifically the updating of kernel.perf_event_paranoid.