NVIDIA

nemo-automodel

Container

NVIDIA

nemo-automodel

NVIDIA NeMo™ AutoModel accelerates LLM and VLM training and fine‑tuning with PyTorch DTensor‑native SPMD, day‑0 Hugging Face support, and optimized parallelism from single‑ to multi‑node scale.

What is the NeMo AutoModel Container?

NVIDIA NeMo™ AutoModel is a PyTorch DTensor-native SPMD open-source training library under the NVIDIA NeMo Framework for scaling training and fine-tuning of large language models (LLMs) and vision-language models (VLMs). Designed for flexibility, reproducibility, and scale, NeMo AutoModel enables both small-scale experiments and massive multi-GPU, multi-node deployments for fast experimentation in research and production environments.

NeMo AutoModel provides day-0 Hugging Face model support with PyTorch-native parallelism, custom-optimized kernels, and memory-efficient recipes—all while preserving the original checkpoint format for seamless use across the Hugging Face ecosystem.

What You Get with NVIDIA NeMo AutoModel Container

Built on PyTorch DTensor with SPMD (Single Program Multiple Data) architecture, NeMo AutoModel provides a production-grade, scalable training platform with comprehensive parallelism strategies, learning algorithms, and day-0 model support:

Advanced Parallelism Strategies

FSDP2 - PyTorch Fully Sharded Data Parallelism v2 for distributed training
HSDP - Hybrid Sharding Data Parallelism for multi-node scaling based on FSDP2
Tensor Parallelism (TP) - Partition model tensors across GPUs with DTensor
Context Parallelism (CP) - Split sequence contexts for extended context windows
Pipeline Parallelism (PP) - Torch-native pipelining composable with FSDP2 and DTensor (3D Parallelism)

Learning Algorithms

Supervised Fine-Tuning (SFT) - Fine-tune models on instruction-following datasets for both LLMs and VLMs
Parameter-Efficient Fine-Tuning (PEFT) - Memory-efficient adaptation with LoRA for LLMs and VLMs
Pre-training - Support for model pre-training including large MoE models (DeepSeekV3, Moonlight-16B-TE)
Knowledge Distillation - Distill knowledge from larger teacher models to smaller student models

Advanced Features

Day-0 Hugging Face Support - Instantly fine-tune models instantiable via transformers (subject to dependency and feature compatibility). Refer to Model Coverage for details.
Sequence Packing - Significant training performance gains through efficient sequence packing
FP8 Mixed Precision - FP8 training support with torchao for torch.compile-supported models
Distributed Checkpointing (DCP) - SafeTensors-based sharded checkpoints with merge/reshard utilities
MoE Model Support - Optimized kernels for Mixture-of-Experts models (DeepSeekV3, Qwen-3, GPT-OSS)
SPMD Architecture - Same training script runs on 1 GPU or 1000+ by changing the mesh configuration
YAML-Driven Recipes - Minimal ceremony with config-based workflows, override any field via CLI

Model Support

Comprehensive support for LLM and VLM families from Hugging Face Hub (1B to 671B parameters):

LLMs: Llama 3.x, Mistral, Mixtral, Qwen 2.5/3, Gemma 2/3, Phi 2/3/4, DeepSeek V3, GPT-OSS, Moonlight, Baichuan, Seed
VLMs: Qwen2.5-VL, Gemma-3-VL, Gemma-3n-VL, Phi-4-Vision
MoE Models: DeepSeekV3 (671B), GPT-OSS (20B-120B), Qwen-3 MoE (30B), Mixtral-8x7B

Performance & Scale

NeMo AutoModel delivers exceptional performance at scale with optimized kernels and efficient parallelism:

DeepSeek V3 (671B): 250 Model TFLOPs/sec/GPU, 1,002 tokens/sec/GPU on 256 GPUs (TE + DeepEP)
GPT-OSS (20B): 279 Model TFLOPs/sec/GPU, 13,058 tokens/sec/GPU on 8 GPUs (TE + DeepEP + FlexAttn)
Qwen3 MoE (30B): 212 Model TFLOPs/sec/GPU, 11,842 tokens/sec/GPU on 8 GPUs (TE + DeepEP)
Scalable Multi-Node Training: Supports training across hundreds of GPUs with efficient resource utilization
Memory Efficiency: Sequence packing and advanced parallelism techniques reduce memory overhead
Flexible Deployment: Works on single-node setups for experimentation and scales to multi-node clusters for production

Refer to the NVIDIA NeMo AutoModel Performance Summary for detailed performance benchmarks.

Getting Started With NVIDIA NeMo AutoModel

Refer to the NVIDIA NeMo AutoModel documentation for step-by-step instructions.

Pull the Container

Select Get Container on this page.

Copy the latest tag image path.

  # example
  docker pull nvcr.io/nvidian/nemo-automodel:25.11

Refer to the NeMo RL releases page for release notes and other available versions.

NeMo AutoModel GitHub

You can check out the project directly on GitHub at NVIDIA-NeMo/Automodel.

Developer Container

For developers who want access to features that have been implemented but are not yet included in a major release, please use our GitHub repository.

The NeMo AutoModel framework is actively developed with regular updates. If you encounter problems, please report the issue on GitHub and specify the version and container details.

Questions? See the current discussions and submit a question.

Report a bug? You can report a bug here.

Documentation

More detailed documentation is available in the NeMo AutoModel User Guide, including comprehensive guides for:

Overview - Architecture and design philosophy
Installation - Setup and environment configuration
LLM Fine-Tuning - Supervised fine-tuning and PEFT for language models
VLM Fine-Tuning - Vision-language model training
Gradient Checkpointing - Memory optimization techniques
FP8 Training - Mixed-precision training configuration
Checkpointing - Distributed checkpoint management
Model Coverage - Complete list of supported models

License

Governing Terms: Your use of the NeMo Automodel is governed by the NVIDIA Software License Agreement and the Product-Specific Terms for NVIDIA AI Products.

Publisher

NVIDIA

Latest Tag26.06

UpdatedJuly 2, 2026 UTC

Compressed Size12.2 GB

Multinode SupportNo

Multi-Arch SupportYes

System

signed images

Labels

AI Deep Learning Examples Language Modeling NeMo NSPECT-VURV-H2II NVIDIA AI PyTorch