Multi-node training admission controller for DGX Cloud
NeMo DGXC Admission Controller Microservice Container
NeMo DGXC Admission Controller microservice enables multi-node GPU training environments by managing Kubernetes resources for high-performance computing. It handles networking configurations for technologies like Elastic Fabric Adapter (EFA) on AWS, InfiniBand on Azure, and RDMA on OCI.
You can use this service to optimize GPU workloads across multiple nodes, configure high-performance networking for distributed training, and ensure proper resource allocation for AI training jobs in Kubernetes clusters.
Resources
Note: Use, distribution or deployment of this microservice in production requires an NVIDIA AI Enterprise License.
Governing Terms
The software and materials are governed by the NVIDIA Software License Agreement and the Product-Specific Terms for NVIDIA AI Products.