This resource consists of the Consistency Distilled Dataset used for Proteina-Atomistica model training.
Dataset Description:
The Consistency Distilled Synthetic Protein Database is a curated collection of high-quality, codesignable protein sequence–structure pairs designed to overcome limitations present in datasets derived from the AlphaFold Database (AFDB). The existing AFDB contains pairs that are not reproducible by state-of-the-art folding models such as ESMFold, AlphaFold2, or Boltz-1, indicating that many sequences may not accurately fold into their predicted structures. To address this, the Proteina-Atomistica Consistency Distilled Database was built from scratch using ProteinMPNN to generate multiple synthetic sequences for each Foldseek AFDB cluster representative structure. These sequences were then re-folded to obtain fully atomistic, self-consistent models. The result aligns the diversity of the original AFDB with the consistency of inverse folding and re-folding.
This dataset is ready for commercial/non-commercial use.
Dataset Owner(s):
NVIDIA Corporation
Dataset Creation Date:
5/15/2025
License/Terms of Use:
CC_BY-4.0
Intended Usage:
Protein designers and researchers alike who wish to scale their protein AI models to predict structure, sequences, and properties.
Dataset Quantification
Record count: 455,473 protein structures
Feature count: 6 metadata features for each structure (id, length, plddt_avg, plddt_std, rmsd_ca, pmpnn_seq)
Measurement of Total Data Storage: 10.96 GB
Reference(s):
https://arxiv.org/pdf/2512.01976
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns (https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).