This is a dataset generated following the preparation steps for UniRef50 and UniRef90 database used for pre-training ESM2. UniProt Reference Cluster (UniRef) databases provide clustered sets of sequences from UniProtKB. UniRef50 is built by clustering UniRef90 seed sequences that have at least 50% sequence identity to, and 80% overlap with, the longest sequence in the cluster. The release from 04/2021 was used for preparaing the pre-training dataset. The representative sequence for each cluster was selected, resulting in approximately 49M protein sequences. A random fraction of 250K sequences was removed for validation after training. The remaining sequences were filtered to remove any training sequences with high sequence similarity to the validation dataset, resulting in 49,425,807 training sequences. The training sequences were randomly split with 3400 sequences in validation, 1M sequences in test, and the remaining in train. A corresponding set of UniRef90 cluster members and the train sequences were also curated to enable sampling during training. UniRef90 cluster members were augmented with sequence data based on data availability in the UniRef100 representative sequence set.
You can use BioNeMo Framework to run ESM-2nv training using this dataset. Read more about BioNeMo Framework here #TODO: change this link to docs.nvidia.
This dataset is being re-distributed under the same license as UniRef databases (Creative Commons Attribution 4.0 International (CC BY 4.0) License)