Linux / arm64
Linux / amd64
NVIDIA NeMo™ Curator is a GPU-accelerated, open-source framework for efficient generative AI model data curation. Specifically designed for fast and scalable dataset preparation and curation for generative AI use cases such as foundation language model pretraining, text-to-image model training, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT), and parameter-efficient fine-tuning (PEFT), NeMo Curator greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings and enabling the preparation of high-quality tokens that accelerate model convergence.
For enterprises building AI solutions, NeMo Curator provides a production-grade, scalable, end-to-end data curation platform that includes comprehensive text and image processing capabilities as well as synthetic data generation pipelines and enterprise support to streamline adoption. Organizations can now efficiently curate massive datasets for their AI operations, enhancing data quality, ensuring compliance, and ultimately driving greater value from their AI investments.
At the heart of NeMo Curator lies the unification of distributed data processing and advanced GPU acceleration. NeMo Curator expertly uses GPU resources and memory across nodes, leading to groundbreaking efficiency gains in data curation workflows. By leveraging distributed computing frameworks like Dask and RAPIDS, NeMo Curator enables seamless multi-node and multi-GPU data processing, significantly reducing curation time and enhancing overall productivity. A standout feature of NeMo Curator is its incorporation of various advanced curation techniques:
Text Curation Capabilities
Image Curation Capabilities
Synthetic Data Generation
Parallel Curation for Machine Translation Capabilities
NeMo Curator container is the leading solution to support large-scale data curation at enterprise scale. The platform supports comprehensive data processing pipelines for text and multimodal datasets, including advanced filtering, deduplication, classification, and quality assessment. In addition to traditional data processing, NeMo Curator provides powerful synthetic data generation capabilities using state-of-the-art language models for creating high-quality training data.
The NeMo Curator container offers an array of techniques to prepare and refine datasets for specialized use cases including quality filtering, safety classification, domain-specific curation, semantic deduplication, PII removal, and synthetic data augmentation. Through these diverse processing options, NeMo Curator offers wide-ranging flexibility that is crucial in meeting varying business requirements for AI model training.
NeMo Curator delivers exceptional performance at scale:
Refer to the NVIDIA NeMo Curator documentation for step-by-step instructions on how to get started quickly with the NeMo Curator framework.
PyPI Installation:
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[all]
Source Installation:
git clone https://github.com/NVIDIA-NeMo/Curator.git
pip install --extra-index-url https://pypi.nvidia.com "./Curator[all]"
NeMo Curator GitHub: You can check out the project directly on GitHub at NVIDIA-NeMo/Curator.
For developers who want access to features that have been implemented but are not yet included in a major release, please use our GitHub repo.
The NeMo Curator framework is actively developed with regular updates. If you encounter problems please report the issue on GitHub and specify the version and container details.
Questions? See the current discussions and submit a question.
Report a bug? You can report a bug here.
More detailed documentation is available in the NeMo Curator User Guide.
NeMo Curator is licensed under the Apache License 2.0. By pulling and using the container, you accept the terms and conditions of this license.
This container may contain additional third-party software components. Please refer to the container documentation for complete license information.