Nvidia Network Operator Helm Chart provides an easy way to install, configure and manage the lifecycle of Nvidia Mellanox network operator.
Nvidia Network Operator leverages Kubernetes CRDs and Operator SDK to manage Networking related Components in order to enable Fast networking, RDMA and GPUDirect for workloads in a Kubernetes cluster. Network Operator works in conjunction with GPU-Operator to enable GPU-Direct RDMA on compatible systems.
The Goal of Network Operator is to manage all networking related components to enable execution of RDMA and GPUDirect RDMA workloads in a kubernetes cluster including:
For more information please visit the official documentation.
Nvidia Network Operator relies on the existance of specific node labels to operate properly. e.g label a node as having Nvidia networking hardware available. This can be achieved by either manually labeling Kubernetes nodes or using Node Feature Discovery to perform the labeling.
To allow zero touch deployment of the Operator we provide a helm chart to be used to
optionally deploy Node Feature Discovery in the cluster. This is enabled via nfd.enabled
chart parameter.
Nvidia Network Operator can operate in unison with SR-IOV Network Operator
to enable SR-IOV workloads in a Kubernetes cluster. We provide a helm chart to be used to optionally
deploy SR-IOV Network Operator in the cluster.
This is enabled via sriovNetworkOperator.enabled
chart parameter.
For more information on how to configure SR-IOV in your Kubernetes cluster using SR-IOV Network Operator refer to the project's github.
Helm provides an install script to copy helm binary to your system:
$ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
$ chmod 500 get_helm.sh
$ ./get_helm.sh
For additional information and methods for installing Helm, refer to the official helm website
# Add Repo
$ helm repo add mellanox https://mellanox.github.io/network-operator
$ helm repo update
# Install Operator
$ helm install -n network-operator --create-namespace --wait network-operator mellanox/network-operator
# View deployed resources
$ kubectl -n network-operator get pods
By default the network operator deploys Node Feature Discovery (NFD)
in order to perform node labeling in the cluster to allow proper scheduling of Network Operator resources.
If the nodes where already labeled by other means, it is possible to disable the deployment of NFD by setting
nfd.enabled=false
chart parameter.
$ helm install --set nfd.enabled=false -n network-operator --create-namespace --wait network-operator mellanox/network-operator
Label | Where |
---|---|
feature.node.kubernetes.io/pci-15b3.present |
Nodes bearing Nvidia Mellanox Networking hardware |
nvidia.com/gpu.present |
Nodes bearing Nvidia GPU hardware |
Note: The labels which Network Operator depends on may change between releases.
Note: By default the operator is deployed without an instance of
NicClusterPolicy
andMacvlanNetwork
custom resources. The user is required to create it later with configuration matching the cluster or use chart parameters to deploy it together with the operator.
To install development version of Network Operator you need to clone repository first and install helm chart from the local directory:
# Clone Network Operatro Repository
$ git clone https://github.com/Mellanox/network-operator.git
# Update chart dependencies
$ cd network-operator/deployment/network-operator && helm dependency update
# Install Operator
$ helm install -n network-operator --create-namespace --wait network-operator ./network-operator
# View deployed resources
$ kubectl -n network-operator get pods
Network Operator has Helm tests to verify deployment. To run tests it is required to set the following chart parameters on helm install/upgrade: deployCR
, rdmaSharedDevicePlugin
, secondaryNetwork
as the test depends on NicClusterPolicy
instance being deployed by Helm.
Supported Tests:
rdmaSharedDevicePlugin.resources
rping
Run the helm test with following command after deploying network operator with helm
$ helm test -n network-operator network-operator --timeout=5m
Notes:
--timeout
which fails test after exceeding given timeoutens2f0
to override it add --set test.pf=<pf_name>
to helm install/upgrade
NicClusterPolicy
custom resource state is Ready
kubectl logs -n <namespace> <test-pod-name>
NOTE: Upgrade capabilities are limited now. Additional manual actions required when containerized OFED driver is used
Before starting the upgrade to a specific release version, please, check release notes for this version to ensure that no additional actions are required.
Since Helm doesn’t support auto-upgrade of existing CRDs, the user needs to follow a two-step process to upgrade the network-operator release:
helm search repo mellanox/network-operator -l
NOTE: add
--devel
option if you want to list beta releases as well
It is possible to retrieve updated CRDs from the Helm chart or from the release branch on GitHub. Example bellow show how to download and unpack Helm chart for specified release and then apply CRDs update from it.
helm pull mellanox/network-operator --version <VERSION> --untar --untardir network-operator-chart
NOTE:
--devel
option required if you want to use the beta release
kubectl apply -f network-operator-chart/network-operator/crds \
-f network-operator-chart/network-operator/charts/sriov-network-operator/crds
Download Helm values for the specific release
helm show values mellanox/network-operator --version=<VERSION> > values-<VERSION>.yaml
Edit values-<VERSION>.yaml
file as required for your cluster.
The network operator has some limitations about which updates in NicClusterPolicy it can handle automatically.
If the configuration for the new release is different from the current configuration in the deployed release,
then some additional manual actions may be required.
Known limitations:
These limitations will be addressed in future releases.
NOTE: changes which were made directly in NicClusterPolicy CR (e.g. with
kubectl edit
) will be overwritten by Helm upgrade
This step is required to prevent the old network-operator version to handle the updated NicClusterPolicy CR. This limitation will be removed in future network-operator releases.
kubectl scale deployment --replicas=0 -n network-operator network-operator
You have to wait for network-operator POD to remove before proceeding.
NOTE: network-operator will be automatically enabled by helm upgrade command, you don't need to enable it manually
helm upgrade -n network-operator network-operator mellanox/network-operator --version=<VERSION> -f values-<VERSION>.yaml
NOTE:
--devel
option required if you want to use the beta release
NOTE: this operation required only if containerized OFED is in use
When containerized OFED driver reloaded on the node, all PODs which use secondary network based on NVIDIA Mellanox NICs will lose network interface in their containers. To prevent outage you need to remove all PODs which use secondary network from the node before you reload the driver POD on it.
Helm upgrade command will just upgrade DaemonSet spec of the OFED driver to point to the new driver version. The OFED driver's DaemonSet will not automatically restart PODs with the driver on the nodes because it uses "OnDelete" updateStrategy. The old OFED version will still run on the node until you explicitly remove the driver POD or reboot the node.
It is possible to remove all PODs with secondary networks from all cluster nodes and then restart OFED PODs on all nodes at once.
The alternative option is to do upgrade in a rolling manner to reduce the impact of the driver upgrade on the cluster. The driver POD restart can be done on each node individually. In this case, PODs with secondary networks should be removed from the single node only, no need to stop PODs on all nodes.
Recommended sequence to reload the driver on the node:
For each node follow these steps
When the OFED driver becomes ready, proceed with the same steps for other nodes
This can be done with node drain command:
kubectl drain <NODE_NAME> --pod-selector=<SELECTOR_FOR_PODS>
NOTE: replace with
-l "network.nvidia.com/operator.mofed.wait=false"
if you want to drain all nodes at once
Find OFED driver POD name for the node
kubectl get pod -l app=mofed-<OS_NAME> -o wide -A
example for Ubuntu 20.04: kubectl get pod -l app=mofed-ubuntu20.04 -o wide -A
Delete OFED driver POD from the node
kubectl delete pod -n <DRIVER_NAMESPACE> <OFED_POD_NAME>
NOTE: replace with
-l app=mofed-ubuntu20.04
if you want to remove OFED PODs on all nodes at once
New version of the OFED POD will automatically start.
After OFED POD is ready on the node you can make node schedulable again.
The command below will uncordon (remove node.kubernetes.io/unschedulable:NoSchedule
taint)
the node and return PODs to it.
kubectl uncordon -l "network.nvidia.com/operator.mofed.wait=false"
In order to tailor the deployment of the network operator to your cluster needs We have introduced the following Chart parameters.
Name | Type | Default | description |
---|---|---|---|
nfd.enabled |
bool | True |
deploy Node Feature Discovery |
sriovNetworkOperator.enabled |
bool | False |
deploy SR-IOV Network Operator |
psp.enabled |
bool | False |
deploy Pod Security Policy |
operator.repository |
string | nvcr.io/nvidia/cloud-native |
Network Operator image repository |
operator.image |
string | network-operator |
Network Operator image name |
operator.tag |
string | None |
Network Operator image tag, if None , then the Chart's appVersion will be used |
operator.imagePullSecrets |
list | [] |
An optional list of references to secrets to use for pulling any of the Network Operator image |
deployCR |
bool | false |
Deploy NicClusterPolicy custom resource according to provided parameters |
These proxy parameter will translate to HTTP_PROXY, HTTPS_PROXY, NO_PROXY environment variables to be used by the network operator and relevant resources it deploys. Production cluster environment can deny direct access to the Internet and instead have an HTTP or HTTPS proxy available.
Name | Type | Default | description |
---|---|---|---|
proxy.httpProxy |
string | None |
proxy URL to use for creating HTTP connections outside the cluster. The URL scheme must be http |
proxy.httpsProxy |
string | None |
proxy URL to use for creating HTTPS connections outside the cluster |
proxy.noProxy |
string | None |
A comma-separated list of destination domain names, domains, IP addresses or other network CIDRs to exclude proxying |
Name | Type | Default | description |
---|---|---|---|
ofedDriver.deploy |
bool | false |
deploy Mellanox OFED driver container |
ofedDriver.repository |
string | mellanox |
Mellanox OFED driver image repository |
ofedDriver.image |
string | mofed |
Mellanox OFED driver image name |
ofedDriver.version |
string | 5.5-1.0.3.2 |
Mellanox OFED driver version |
ofedDriver.imagePullSecrets |
list | [] |
An optional list of references to secrets to use for pulling any of the Mellanox OFED driver image |
ofedDriver.startupProbe.initialDelaySeconds |
int | 10 | Mellanox OFED startup probe initial delay |
ofedDriver.startupProbe.periodSeconds |
int | 10 | Mellanox OFED startup probe interval |
ofedDriver.livenessProbe.initialDelaySeconds |
int | 30 | Mellanox OFED liveness probe initial delay |
ofedDriver.livenessProbe.periodSeconds |
int | 30 | Mellanox OFED liveness probe interval |
ofedDriver.readinessProbe.initialDelaySeconds |
int | 10 | Mellanox OFED readiness probe initial delay |
ofedDriver.readinessProbe.periodSeconds |
int | 30 | Mellanox OFED readiness probe interval |
Name | Type | Default | description |
---|---|---|---|
nvPeerDriver.deploy |
bool | false |
deploy NVIDIA Peer memory driver container |
nvPeerDriver.repository |
string | mellanox |
NVIDIA Peer memory driver image repository |
nvPeerDriver.image |
string | nv-peer-mem-driver |
NVIDIA Peer memory driver image name |
nvPeerDriver.version |
string | 1.1-0 |
NVIDIA Peer memory driver version |
nvPeerDriver.imagePullSecrets |
list | [] |
An optional list of references to secrets to use for pulling any of the NVIDIA Peer memory driver image |
nvPeerDriver.gpuDriverSourcePath |
string | /run/nvidia/driver |
GPU driver soruces root filesystem path(usually used in tandem with gpu-operator) |
Name | Type | Default | description |
---|---|---|---|
rdmaSharedDevicePlugin.deploy |
bool | true |
Deploy RDMA Shared device plugin |
rdmaSharedDevicePlugin.repository |
string | nvcr.io/nvidia/cloud-native |
RDMA Shared device plugin image repository |
rdmaSharedDevicePlugin.image |
string | k8s-rdma-shared-dev-plugin |
RDMA Shared device plugin image name |
rdmaSharedDevicePlugin.version |
string | v1.2.1 |
RDMA Shared device plugin version |
rdmaSharedDevicePlugin.imagePullSecrets |
list | [] |
An optional list of references to secrets to use for pulling any of the RDMA Shared device plugin image |
rdmaSharedDevicePlugin.resources |
list | See below | RDMA Shared device plugin resources |
Consists of a list of RDMA resources each with a name and selector of RDMA capable network devices to be associated with the resource. Refer to RDMA Shared Device Plugin Selectors for supported selectors.
resources:
- name: rdma_shared_device_a
vendors: [15b3]
deviceIDs: [1017]
ifNames: [enp5s0f0]
- name: rdma_shared_device_b
vendors: [15b3]
deviceIDs: [1017]
ifNames: [ib0, ib1]
Name | Type | Default | description |
---|---|---|---|
sriovDevicePlugin.deploy |
bool | true |
Deploy SR-IOV Network device plugin |
sriovDevicePlugin.repository |
string | ghcr.io/k8snetworkplumbingwg |
SR-IOV Network device plugin image repository |
sriovDevicePlugin.image |
string | sriov-network-device-plugin |
SR-IOV Network device plugin image name |
sriovDevicePlugin.version |
string | a765300344368efbf43f71016e9641c58ec1241b |
SR-IOV Network device plugin version |
sriovDevicePlugin.imagePullSecrets |
list | [] |
An optional list of references to secrets to use for pulling any of the SR-IOV Network device plugin image |
sriovDevicePlugin.resources |
list | See below | SR-IOV Network device plugin resources |
Consists of a list of RDMA resources each with a name and selector of RDMA capable network devices to be associated with the resource. Refer to SR-IOV Network Device Plugin Selectors for supported selectors.
resources:
- name: hostdev
vendors: [15b3]
Note: The parameter listed are non-exhaustive, for the full list of chart parameters refer to the file:
values.yaml
Name | Type | Default | description |
---|---|---|---|
secondaryNetwork.deploy |
bool | true |
Deploy Secondary Network |
Specifies components to deploy in order to facilitate a secondary network in Kubernetes. It consists of the following optionally deployed components:
Name | Type | Default | description |
---|---|---|---|
cniPlugins.deploy |
bool | true |
Deploy CNI Plugins Secondary Network |
cniPlugins.image |
string | plugins |
CNI Plugins image name |
cniPlugins.repository |
string | ghcr.io/k8snetworkplumbingwg |
CNI Plugins image repository |
cniPlugins.version |
string | v0.8.7-amd64 |
CNI Plugins image version |
cniPlugins.imagePullSecrets |
list | [] |
An optional list of references to secrets to use for pulling any of the CNI Plugins image |
Name | Type | Default | description |
---|---|---|---|
multus.deploy |
bool | true |
Deploy Multus Secondary Network |
multus.image |
string | multus-cni |
Multus image name |
multus.repository |
string | ghcr.io/k8snetworkplumbingwg |
Multus image repository |
multus.version |
string | v3.8 |
Multus image version |
multus.imagePullSecrets |
list | [] |
An optional list of references to secrets to use for pulling any of the Multus image |
multus.config |
string | `` | Multus CNI config, if empty then config will be automatically generated from the CNI configuration file of the master plugin (the first file in lexicographical order in cni-conf-dir) |
Name | Type | Default | description |
---|---|---|---|
ipamPlugin.deploy |
bool | true |
Deploy IPAM CNI Plugin Secondary Network |
ipamPlugin.image |
string | whereabouts |
IPAM CNI Plugin image name |
ipamPlugin.repository |
string | ghcr.io/k8snetworkplumbingwg |
IPAM CNI Plugin image repository |
ipamPlugin.version |
string | v0.4.2-amd64 |
IPAM CNI Plugin image version |
ipamPlugin.imagePullSecrets |
list | [] |
An optional list of references to secrets to use for pulling any of the IPAM CNI Plugin image |
As there are several parameters that are required to be provided to create the custom resource during operator deployment, it is recommended that a configuration file be used. While its possible to provide override to the parameter via CLI it would simply be cumbersome.
Below are several deployment examples values.yaml
provided to helm during installation
of the network operator in the following manner:
$ helm install -f ./values.yaml -n network-operator --create-namespace --wait network-operator mellanox/network-operator
Network Operator deployment with a specific version of OFED driver and a single RDMA resource mapped to enp1
netdev.
values.yaml:
deployCR: true
ofedDriver:
deploy: true
version: 5.3-1.0.0.1
rdmaSharedDevicePlugin:
deploy: true
resources:
- name: rdma_shared_device_a
ifNames: [enp1]
Network Operator deployment with the default version of OFED and NV Peer mem driver, RDMA device
plugin with two RDMA resources, the first mapped to enp1
and enp2
, the second mapped to ib0
.
values.yaml:
deployCR: true
ofedDriver:
deploy: true
nvPeerDriver:
deploy: true
rdmaSharedDevicePlugin:
deploy: true
resources:
- name: rdma_shared_device_a
ifNames: [enp1, enp2]
- name: rdma_shared_device_b
ifNames: [ib0]
Network Operator deployment with:
ib0
values.yaml:
deployCR: true
rdmaSharedDevicePlugin:
deploy: true
resources:
- name: rdma_shared_device_a
ifNames: [ib0]
secondaryNetwork:
deploy: true
multus:
deploy: true
cniPlugins:
deploy: true
ipamPlugin:
deploy: true
Network Operator deployment with the default version of RDMA device plugin with RDMA resource mapped to Mellanox ConnectX-5.
values.yaml:
deployCR: true
rdmaSharedDevicePlugin:
deploy: true
resources:
- name: rdma_shared_device_a
vendors: [15b3]
deviceIDs: [1017]