NVIDIA
NVIDIA
NMC AHR
Resource
NVIDIA
NVIDIA
NMC AHR

NVIDIA Mission Control AHR config files and playbooks

Subscribe to get accessSubscribe to the product below to access this premium content:
NVIDIA Mission Control
NVIDIA Mission ControlNVIDIA Mission Control™ powers every aspect of AI factory operations — from developer workloads to infrastructure to facilities — with the skills of a world-class operations team delivered as software. It powers NVIDIA Blackwell™ data centers for the newest frontiers of AI, bringing instant agility to inference and training workloads and full-stack intelligence that delivers world-class infrastructure resiliency. Mission Control lets every enterprise run AI with hyperscale-grade efficiency so you can accelerate AI experimentation.
Note: You can gain access to hundreds more GPU-optimized artifacts by creating a free NGC account.
Already Subscribed?Log in
Subscribe Now

NVIDIA Mission Control autonomous hardware recovery

NVIDIA Mission Control autonomous hardware recovery (AHR) automates the testing, diagnosis, and repair of various NVIDIA platforms and will be delivered as a component of the NVIDIA Mission Control (NMC) product. This repository contains various baseline, health checks, actions, alarms and runbooks for deploying to your AHR instances.

Usage

We use OpenTofu to deploy these changes to environments. In almost all cases, you will change into the appropriate hardware directory, like CHIP/GB200/Baseline and apply the OpenTofu from that location. It will pull in the appropriate modules for the deployment.

Create a terraform.tfvars file and copy it to the appropriate directory, as noted below. If the terraform.tfvars file is not present, you will be prompted for required inputs during tofu plan and tofu apply.

# terraform.tfvars

# The name of the headnode
# Note: only one node is supported
headnode_name="headnode"

# The name of the Slurm node from where to submit slurm jobs
# Note: Only one node is supported
slurm_node_name="slurmname"

# The URL of the Shoreline API Endpoint
shoreline_url="https://your-instance.nvidia.com"

# The jwt for the Shoreline API, found in Access Controll
shoreline_token="<jwt>"

# Nvidia Container Registry token
nvcr_token="<token>"

With the above file, run the following:

# cd to the appropriate directory (with terraform.tfvars)
cd CHIP/<HARDWARE>/Baseline

tofu init

# if terraform.tfvars does not exist,
# you will be prompted for values
tofu plan
tofu apply

Support

For any questions, please feel free to reach out on slack in our #shoreline-customer-success channel.

License

By pulling and using the container, you accept the terms and conditions of this End User License Agreement and Product-Specific Terms.

Publisher
NVIDIA
NVIDIA
Latest Version2.3.28
UpdatedJune 2, 2026 UTC
Compressed Size20.49 MB