ESM-1nv is a model that has been trained on protein sequences. The embeddings from its encoder can be used as features for predictive models.
ESM-1nv was developed using the BioNeMo framework. The model uses an architecture called Bidirectional Encoder Representations from Transformers (BERT) and is based on the ESM-1 model [2][3]. Pre-norm layer normalization and GELU activation are used throughout. The model has six layers, 12 attention heads, a hidden space dimension of 768, and contains 44M parameters.
Input sequence length is limited to 512 amino acids.
ESM-1nv was trained with data parallelism on 176 A100 GPUs for 420 epochs (approximately 349500 iterations) using a micro batch size of 370 sequences per GPU. Cosine annealing was used, with a minimum learning rate of 2.0e-05, 500 warmup steps, and 50000 constant steps. Fused Adam optimization was used with parameters β1 = 0.9, β2 = 0.98, and weight decay = 0.01. Dropout was set to 0.1 during training. The model training was then continued on 144 A100 GPUs for an additional 600 epochs, resulting in a total of 957610 iterations. The weights of the last 47 checkpoints were averaged to produce the final model.
UniRef50 (release 05/2022) was used for training [4]. The reference sequence for each cluster was selected, resulting in approximately 52M protein sequences. The sequences were randomly split with 5K sequences in validation, 1M sequences in test, and the remaining in train. Truncation of protein sequences longer than 1024 amino acids and data masking was performed as described previously [3]. The input tokens were randomly masked at a rate of 15% with the masked tokens being predicted. During training by minimizing a cross-entropy categorical loss in order to predict the masked tokens [3].
Learn more about ESM-1nv here.
A permissive license whose main conditions require preservation of copyright and license notices. Contributors provide an express grant of patent rights. Licensed works, modifications, and larger works may be distributed under different terms and without source code.