NVIDIA

NeMo Curator

Container

NVIDIA

NeMo Curator

NVIDIA NeMo™ Curator accelerates generative AI model development with GPU-powered data curation, offering comprehensive text, image, video, and audio processing for enterprise-scale deployments.

NVIDIA AI Enterprise Supported

What is the NeMo Curator Container?

NVIDIA NeMo™ Curator is a GPU-accelerated framework for efficient generative AI model data curation across multiple modalities. Specifically designed for fast and scalable dataset preparation and curation for generative AI use cases such as foundation language model pretraining, multimodal model training, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT), and parameter-efficient fine-tuning (PEFT), NeMo Curator greatly accelerates data curation by leveraging GPUs with Ray and RAPIDS, resulting in significant time savings and enabling the preparation of high-quality tokens that accelerate model convergence.

For enterprises building AI solutions, NeMo Curator provides a production-grade, scalable, end-to-end data curation platform that includes comprehensive text, image, video, and audio processing capabilities as well and enterprise support to streamline adoption. Organizations can now efficiently curate massive multimodal datasets for their AI operations, enhancing data quality, ensuring compliance, and ultimately driving greater value from their AI investments.

What You Get with NVIDIA NeMo Curator Container

At the heart of NeMo Curator lies the unification of distributed data processing and advanced GPU acceleration across multiple modalities. Built on a modern Ray-based architecture, NeMo Curator expertly uses GPU resources and memory across nodes, leading to groundbreaking efficiency gains in data curation workflows. By leveraging distributed computing frameworks like Ray and RAPIDS, NeMo Curator enables seamless multi-node and multi-GPU data processing with autoscaling support, significantly reducing curation time and enhancing overall productivity. A standout feature of NeMo Curator is its incorporation of various advanced curation techniques across all supported modalities:

Text Curation Capabilities

Download and Extraction (Common Crawl, Wikipedia, ArXiv)
Language Identification and Text Cleaning
Heuristic Filters (30+)
Classifier Filters (Domain, Quality, Safety, Prompt Task, Prompt Complexity, Educational Quality Classification)
GPU-Accelerated Deduplication (Exact, Fuzzy via MinHash LSH, Semantic)

Image Curation Capabilities

DALI-based High-Performance Image Loading from WebDataset tar shards
Embedding Creation and Classification (Aesthetic, NSFW)
GPU-Accelerated Semantic Deduplication

Video Curation Capabilities (New in 25.09)

Video Splitting (Fixed-stride and Scene-change Detection via TransNetV2)
Semantic Deduplication (K-means Clustering and Pairwise Similarity)
Content Filtering (Motion-based and Aesthetic Filtering)
Embedding Generation (InternVideo2 and Cosmos-Embed1 Models)

Audio Curation Capabilities (New in 25.09)

Automatic Speech Recognition (ASR) using NeMo Framework Pretrained Models
Quality Assessment (Word Error Rate and Character Error Rate Calculation)
Speech Metrics (Duration Analysis and Speech Rate Metrics)
Text Integration via AudioToDocumentStage
JSONL Manifest Format Support

Performance & Scale

Note: The following performance numbers are from v25.07.

NeMo Curator delivers exceptional performance at scale:

Deduplicated 1.96 trillion tokens in 0.5 hours using 32 NVIDIA H100 GPUs (RedPajama V2 scale)
Up to 80% data reduction with significant improvements in downstream model performance
16x faster processing compared to alternative libraries for fuzzy deduplication
Enhanced GPU utilization through improved model-based classifier throughput with length-based sequence sorting

Getting Started With NVIDIA NeMo Curator

Refer to the NVIDIA NeMo Curator documentation for step-by-step instructions on how to get started quickly with the NeMo Curator framework.

Installation Options

Docker Container: The latest NeMo Curator container (25.09) is available through the NGC Catalog:

docker pull nvcr.io/nvidia/nemo-curator:25.09

NeMo Curator GitHub: You can check out the project directly on GitHub at NVIDIA-NeMo/Curator.

Developer Container

For developers who want access to features that have been implemented but are not yet included in a major release, please use our GitHub repo.

The NeMo Curator framework is actively developed with regular updates. If you encounter problems please report the issue on GitHub and specify the version and container details.

Questions? See the current discussions and submit a question.

Report a bug? You can report a bug here.

Known Limitations

Note: The following features are currently being refactored for Ray compatibility and will be available in future releases:

Synthetic Data Generation: Synthetic text generation features are being updated for Ray backend compatibility
PII Processing: Personal Identifiable Information removal tools are being migrated to the new architecture
Data Blending & Shuffling: Multi-source dataset blending and large-scale shuffling operations are under development

Technical Blogs

Documentation

More detailed documentation is available in the NeMo Curator User Guide, including comprehensive guides for each supported modality:

License

GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the Product-Specific Terms for NVIDIA AI Products (found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).

Get Help

Enterprise Support

Get access to knowledge base articles and support cases or submit a ticket.

Publisher

NVIDIA

LicenseNVIDIA proprietary

Latest Tag26.04

UpdatedMay 14, 2026 UTC

Compressed Size14.93 GB

Multinode SupportNo

Multi-Arch SupportYes

System

signed images

Labels

NSPECT-5LHI-MDN0