NGC Catalog
CLASSIC
Welcome Guest
Containers
NeMo Curator

NeMo Curator

For copy image paths and more information, please view on a desktop device.
Description
NVIDIA NeMo™ Curator supports enterprise development of generative AI models with GPU-accelerated data curation, comprehensive text and image processing, and scalable synthetic data generation capabilities.
Publisher
NVIDIA
Latest Tag
25.07
Modified
July 25, 2025
Compressed Size
11.93 GB
Multinode Support
No
Multi-Arch Support
Yes
25.07 (Latest) Security Scan Results

Linux / arm64

Sorry, your browser does not support inline SVG.

Linux / amd64

Sorry, your browser does not support inline SVG.

What is the NeMo Curator Container?

NVIDIA NeMo™ Curator is a GPU-accelerated, open-source framework for efficient generative AI model data curation. Specifically designed for fast and scalable dataset preparation and curation for generative AI use cases such as foundation language model pretraining, text-to-image model training, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT), and parameter-efficient fine-tuning (PEFT), NeMo Curator greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings and enabling the preparation of high-quality tokens that accelerate model convergence.

For enterprises building AI solutions, NeMo Curator provides a production-grade, scalable, end-to-end data curation platform that includes comprehensive text and image processing capabilities as well as synthetic data generation pipelines and enterprise support to streamline adoption. Organizations can now efficiently curate massive datasets for their AI operations, enhancing data quality, ensuring compliance, and ultimately driving greater value from their AI investments.

What You Get with NVIDIA NeMo Curator Container

At the heart of NeMo Curator lies the unification of distributed data processing and advanced GPU acceleration. NeMo Curator expertly uses GPU resources and memory across nodes, leading to groundbreaking efficiency gains in data curation workflows. By leveraging distributed computing frameworks like Dask and RAPIDS, NeMo Curator enables seamless multi-node and multi-GPU data processing, significantly reducing curation time and enhancing overall productivity. A standout feature of NeMo Curator is its incorporation of various advanced curation techniques:

Text Curation Capabilities

  • Download and Extraction (Common Crawl, Wikipedia, ArXiv)
  • Language Identification and Text Cleaning
  • Heuristic Filters (30+)
  • Classifier Filters (Domain, Quality, Safety, Prompt Task, Prompt Complexity, Educational Quality Classification)
  • GPU-Accelerated Deduplication (Exact, Fuzzy via MinHash LSH, Semantic)
  • Downstream-task Decontamination
  • Personal Identifiable Information (PII) Redaction

Image Curation Capabilities

  • Embedding Creation and Classification (Aesthetic, NSFW)
  • GPU-Accelerated Semantic Deduplication

Synthetic Data Generation

  • Pre-built Pipelines for LLM Fine-tuning/Pre-training
    • Prompt Generation (Open Q&A, Closed Q&A, Writing, Math, Coding)
    • Two-Turn Prompt Generation
    • Dialogue Generation
    • Entity Classification
    • Rewrite to Wikipedia Style
    • Generate Diverse QA Pairs
    • Generate Knowledge List
    • Distill Document
    • Extract Knowledge
  • Pre-built Pipelines for Text Retrieval
    • Q&A Evaluation Corpus Creation
    • Q&A Fine-tuning Corpus Creation (with Hard Negative Mining)
    • Model-Filters (Easiness, Answerability)
  • Tool Support (OpenAI API Compatible, Asynchronous Generation, Customizable Prompt Templates)

Parallel Curation for Machine Translation Capabilities

  • Load/Write Bitext Files
  • Heuristic Filtering (Histogram, Length Ratio)
  • Classifier Filtering (Comet, Cometoid)

NeMo Curator container is the leading solution to support large-scale data curation at enterprise scale. The platform supports comprehensive data processing pipelines for text and multimodal datasets, including advanced filtering, deduplication, classification, and quality assessment. In addition to traditional data processing, NeMo Curator provides powerful synthetic data generation capabilities using state-of-the-art language models for creating high-quality training data.

The NeMo Curator container offers an array of techniques to prepare and refine datasets for specialized use cases including quality filtering, safety classification, domain-specific curation, semantic deduplication, PII removal, and synthetic data augmentation. Through these diverse processing options, NeMo Curator offers wide-ranging flexibility that is crucial in meeting varying business requirements for AI model training.

Performance & Scale

NeMo Curator delivers exceptional performance at scale:

  • Deduplicated 1.96 trillion tokens in 0.5 hours using 32 NVIDIA H100 GPUs (RedPajama V2 scale)
  • Up to 80% data reduction with significant improvements in downstream model performance
  • Efficient Common Crawl curation: from 2.8TB raw to 0.52TB high-quality data in under 38 hours on 30 CPU nodes
  • 16x faster processing compared to alternative libraries for fuzzy deduplication

Getting Started With NVIDIA NeMo Curator

Refer to the NVIDIA NeMo Curator documentation for step-by-step instructions on how to get started quickly with the NeMo Curator framework.

Installation Options

PyPI Installation:

pip install --extra-index-url https://pypi.nvidia.com nemo-curator[all]

Source Installation:

git clone https://github.com/NVIDIA-NeMo/Curator.git
pip install --extra-index-url https://pypi.nvidia.com "./Curator[all]"

NeMo Curator GitHub: You can check out the project directly on GitHub at NVIDIA-NeMo/Curator.

Developer Container

For developers who want access to features that have been implemented but are not yet included in a major release, please use our GitHub repo.

The NeMo Curator framework is actively developed with regular updates. If you encounter problems please report the issue on GitHub and specify the version and container details.

Questions? See the current discussions and submit a question.

Report a bug? You can report a bug here.

Technical Blogs

  • Curating Trillion-Token Datasets: Introducing NVIDIA NeMo Data Curator
  • Scale and Curate High-Quality Datasets for LLM Training with NVIDIA NeMo Curator
  • Curating Custom Datasets for LLM Training with NVIDIA NeMo Curator
  • Curating Custom Datasets for LLM Parameter-Efficient Fine-Tuning with NVIDIA NeMo Curator
  • Streamlining Data Processing for Domain Adaptive Pretraining with NVIDIA NeMo Curator

Documentation

More detailed documentation is available in the NeMo Curator User Guide.

License

NeMo Curator is licensed under the Apache License 2.0. By pulling and using the container, you accept the terms and conditions of this license.

This container may contain additional third-party software components. Please refer to the container documentation for complete license information.