
NVIDIA Mission Control autonomous hardware recovery
NVIDIA Mission Control autonomous hardware recovery (AHR) automates the testing, diagnosis, and repair of various NVIDIA platforms and will be delivered as a component of the NVIDIA Mission Control (NMC) product. This repository contains various baseline, health checks, actions, alarms and runbooks for deploying to your AHR instances.
Usage
We use OpenTofu to deploy these changes to environments. In almost all cases, you will change into the appropriate hardware directory, like CHIP/GB200/Baseline and apply the OpenTofu from that location. It will pull in the appropriate modules for the deployment.
Create a terraform.tfvars file and copy it to the appropriate directory, as noted below. If the terraform.tfvars file is not present, you will be prompted for required inputs during tofu plan and tofu apply.
With the above file, run the following:
Support
For any questions, please feel free to reach out on slack in our #shoreline-customer-success channel.
License
By pulling and using the container, you accept the terms and conditions of this End User License Agreement and Product-Specific Terms.