ESM-2nv is a model that has been trained on protein sequences. The embeddings from its encoder can be used as features for predictive models.
ESM-2nv was developed using the BioNeMo framework by converting the ESM-2 model checkpoints from HuggingFace into the NeMo framework [2]. The underlying ESM-2 model uses an architecture called Bidirectional Encoder Representations from Transformers (BERT) and improves upon the ESM-1b model [2], [3] with various features including architectural changes, differences in embeddings, and custom transformations describe below.
ESM-2nv models are in principle compatible with ESM-2 checkpoints, meaning that ESM-2 public checkpoints from HuggingFace can be loaded into ESM-2nv architectures of similar size. The 650M model has 33 layers, 20 attention heads, a hidden space dimension of 1280, and contains 650M parameters.
ESM-2nv models are HuggingFace ESM-2 model checkpoints that have been converted into NeMo-optimized BioNeMo model checkpoints. ESM-2nv achieves the same performance benchmarks as ESM-2 but is optimized to provide faster training and inference on NVIDIA GPUs. ESM-2nv enables customization for pre-training and inference parameters through YAML configuration files at time of model instantiation. A complete, curated pre-training dataset is provided with the BioNeMo framework release of ESM-2nv to facilitate pre-training from scratch.
Unlike ESM-2 pre-training data, the curated pre-training dataset provided with ESM-2nv release contains hits for de novo proteins, since sequences in UniRef100, UniRef90, and UniRef50 with high sequence similarity to a non-public 81 de novo proteins [2] are not filtered.
ESM-2nv can be trained from scratch using the provided dataset and code. The ESM-2nv 650M checkpoints in the current release have been converted from the models provided by Lin, et. al [2] and made available at HuggingFace 650M model
UniRef50 release 04/20221 was used for training [4]. The representative sequence for each cluster was selected, resulting in approximately 49M protein sequences. The sequences were randomly split with 250K sequences in validation and the remaining in train. All train sequences that matched a validation sequence with 50% sequence identity were removed from the train set, resulting in 49,425,807 train sequences. A sampling dataset of UniRef90 sequences was created based on any UniRef90 representatives and cluster members that had complete sequences available from UniRef90 or UniRef100, and filtered to UniRef90 sequences for clusters that corresponded to the UniRef50 train set. This UniRef90 dataset was combined with the filtered UniRef50 training dataset to create the sampling fasta file. A mapping file was created to enable rapid replacement of UniRef50 sequences with a sequence sampled uniformly from the corresponding records in the sampling fasta file during each training update. The UniRef50 training fasta was sorted in the order of occurrence of records in column 1 of the mapping file. The UniRef90+UniRef50 sampling fasta file was sorted in the order of occurrence of records in column 2 of the mapping file.
Protein sequences longer than 1024 amino acids were cropped to 1023 from sequence start {cite:p}devlin2018bert
.
A permissive license whose main conditions require preservation of copyright and license notices. Contributors provide an express grant of patent rights. Licensed works, modifications, and larger works may be distributed under different terms and without source code.