NVIDIA NeMo™ AutoModel accelerates LLM and VLM training and fine‑tuning with PyTorch DTensor‑native SPMD, day‑0 Hugging Face support, and optimized parallelism from single‑ to multi‑node scale.
What is the NeMo AutoModel Container?
NVIDIA NeMo™ AutoModel is a PyTorch DTensor-native SPMD open-source training library under the NVIDIA NeMo Framework for scaling training and fine-tuning of large language models (LLMs) and vision-language models (VLMs). Designed for flexibility, reproducibility, and scale, NeMo AutoModel enables both small-scale experiments and massive multi-GPU, multi-node deployments for fast experimentation in research and production environments.
NeMo AutoModel provides day-0 Hugging Face model support with PyTorch-native parallelism, custom-optimized kernels, and memory-efficient recipes—all while preserving the original checkpoint format for seamless use across the Hugging Face ecosystem.
What You Get with NVIDIA NeMo AutoModel Container
Built on PyTorch DTensor with SPMD (Single Program Multiple Data) architecture, NeMo AutoModel provides a production-grade, scalable training platform with comprehensive parallelism strategies, learning algorithms, and day-0 model support:
Advanced Parallelism Strategies
- FSDP2 - PyTorch Fully Sharded Data Parallelism v2 for distributed training
- HSDP - Hybrid Sharding Data Parallelism for multi-node scaling based on FSDP2
- Tensor Parallelism (TP) - Partition model tensors across GPUs with DTensor
- Context Parallelism (CP) - Split sequence contexts for extended context windows
- Pipeline Parallelism (PP) - Torch-native pipelining composable with FSDP2 and DTensor (3D Parallelism)
Learning Algorithms
- Supervised Fine-Tuning (SFT) - Fine-tune models on instruction-following datasets for both LLMs and VLMs
- Parameter-Efficient Fine-Tuning (PEFT) - Memory-efficient adaptation with LoRA for LLMs and VLMs
- Pre-training - Support for model pre-training including large MoE models (DeepSeekV3, Moonlight-16B-TE)
- Knowledge Distillation - Distill knowledge from larger teacher models to smaller student models
Advanced Features
- Day-0 Hugging Face Support - Instantly fine-tune models instantiable via
transformers(subject to dependency and feature compatibility). Refer to Model Coverage for details. - Sequence Packing - Significant training performance gains through efficient sequence packing
- FP8 Mixed Precision - FP8 training support with torchao for torch.compile-supported models
- Distributed Checkpointing (DCP) - SafeTensors-based sharded checkpoints with merge/reshard utilities
- MoE Model Support - Optimized kernels for Mixture-of-Experts models (DeepSeekV3, Qwen-3, GPT-OSS)
- SPMD Architecture - Same training script runs on 1 GPU or 1000+ by changing the mesh configuration
- YAML-Driven Recipes - Minimal ceremony with config-based workflows, override any field via CLI
Model Support
Comprehensive support for LLM and VLM families from Hugging Face Hub (1B to 671B parameters):
- LLMs: Llama 3.x, Mistral, Mixtral, Qwen 2.5/3, Gemma 2/3, Phi 2/3/4, DeepSeek V3, GPT-OSS, Moonlight, Baichuan, Seed
- VLMs: Qwen2.5-VL, Gemma-3-VL, Gemma-3n-VL, Phi-4-Vision
- MoE Models: DeepSeekV3 (671B), GPT-OSS (20B-120B), Qwen-3 MoE (30B), Mixtral-8x7B
Performance & Scale
NeMo AutoModel delivers exceptional performance at scale with optimized kernels and efficient parallelism:
- DeepSeek V3 (671B): 250 Model TFLOPs/sec/GPU, 1,002 tokens/sec/GPU on 256 GPUs (TE + DeepEP)
- GPT-OSS (20B): 279 Model TFLOPs/sec/GPU, 13,058 tokens/sec/GPU on 8 GPUs (TE + DeepEP + FlexAttn)
- Qwen3 MoE (30B): 212 Model TFLOPs/sec/GPU, 11,842 tokens/sec/GPU on 8 GPUs (TE + DeepEP)
- Scalable Multi-Node Training: Supports training across hundreds of GPUs with efficient resource utilization
- Memory Efficiency: Sequence packing and advanced parallelism techniques reduce memory overhead
- Flexible Deployment: Works on single-node setups for experimentation and scales to multi-node clusters for production
Refer to the NVIDIA NeMo AutoModel Performance Summary for detailed performance benchmarks.
Getting Started With NVIDIA NeMo AutoModel
Refer to the NVIDIA NeMo AutoModel documentation for step-by-step instructions.
Pull the Container
- Select Get Container on this page.
- Copy the latest tag image path.
# example docker pull nvcr.io/nvidian/nemo-automodel:25.11
Refer to the NeMo RL releases page for release notes and other available versions.
NeMo AutoModel GitHub
You can check out the project directly on GitHub at NVIDIA-NeMo/Automodel.
Developer Container
For developers who want access to features that have been implemented but are not yet included in a major release, please use our GitHub repository.
The NeMo AutoModel framework is actively developed with regular updates. If you encounter problems, please report the issue on GitHub and specify the version and container details.
Questions? See the current discussions and submit a question.
Report a bug? You can report a bug here.
Documentation
More detailed documentation is available in the NeMo AutoModel User Guide, including comprehensive guides for:
- Overview - Architecture and design philosophy
- Installation - Setup and environment configuration
- LLM Fine-Tuning - Supervised fine-tuning and PEFT for language models
- VLM Fine-Tuning - Vision-language model training
- Gradient Checkpointing - Memory optimization techniques
- FP8 Training - Mixed-precision training configuration
- Checkpointing - Distributed checkpoint management
- Model Coverage - Complete list of supported models
License
Governing Terms: Your use of the NeMo Automodel is governed by the NVIDIA Software License Agreement and the Product-Specific Terms for NVIDIA AI Products.