ESM-1nv

Logo for ESM-1nv
Description
A NeMo Megatron BERT based model trained on protein sequences.
Publisher
-
Latest Version
1.0
Modified
November 27, 2023
Size
154.35 MB

Model Overview

ESM-1nv is a model that has been trained on protein sequences. The embeddings from its encoder can be used as features for predictive models.

Model Architecture

ESM-1nv was developed using the BioNeMo framework. The model uses an architecture called Bidirectional Encoder Representations from Transformers (BERT) and is based on the ESM-1 model [2][3]. Pre-norm layer normalization and GELU activation are used throughout. The model has six layers, 12 attention heads, a hidden space dimension of 768, and contains 44M parameters.

Input sequence length is limited to 512 amino acids.

Training

ESM-1nv was trained with data parallelism on 176 A100 GPUs for 420 epochs (approximately 349500 iterations) using a micro batch size of 370 sequences per GPU. Cosine annealing was used, with a minimum learning rate of 2.0e-05, 500 warmup steps, and 50000 constant steps. Fused Adam optimization was used with parameters β1 = 0.9, β2 = 0.98, and weight decay = 0.01. Dropout was set to 0.1 during training. The model training was then continued on 144 A100 GPUs for an additional 600 epochs, resulting in a total of 957610 iterations. The weights of the last 47 checkpoints were averaged to produce the final model.

Dataset and Processing

UniRef50 (release 05/2022) was used for training [4]. The reference sequence for each cluster was selected, resulting in approximately 52M protein sequences. The sequences were randomly split with 5K sequences in validation, 1M sequences in test, and the remaining in train. Truncation of protein sequences longer than 1024 amino acids and data masking was performed as described previously [3]. The input tokens were randomly masked at a rate of 15% with the masked tokens being predicted. During training by minimizing a cross-entropy categorical loss in order to predict the masked tokens [3].

How to Use this Model

  • Compute embeddings from input protein sequences. Embeddings are created for each amino acid in the protein sequence. Embeddings can then be used for downstream tasks such as prediction of secondary structure, subcellular localization, or others, as detailed by the FLIP benchmark tasks [1].
  • The recommended way to consume this model is to use it inside BioNeMo Framework Container. BioNeMo is a Framework for training and deploying large biomolecular language models at supercomputing scale for the discovery and development of therapeutics.
  • Find out more about BioNeMo and it's applications here
  • Click here for example tutorials on how to use ESM-1nv model in BioNeMo Framework.

Suggested Reading

Learn more about ESM-1nv here.

References

  1. Dallago C., Mou J., Johnston K. E., Wittmann B. J., Bhattacharya N., Goldman S., Madani A., Yang K. K., "FLIP: Benchmark tasks in fitness landscape inference for proteins", bioRxiv, 2022, doi.
  2. Rives A., Meier J., Sercu T., Goyal S., Lin Z., Liu J., Guo D., Ott M., Zitnick C. L., Ma J., Fergus R., "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences", Proc. Natl. Acad. Sci. U.S.A., 2021, doi.
  3. Devlin J., Chang M.-W., Lee K., Toutanova K., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", ArXiv, 2018, arXiv:1810.04805.
  4. UniProt Consortium, "UniProt: the universal protein knowledgebase in 2021", Nucleic Acids Res., 2021, doi.

License

Apache License 2.0

A permissive license whose main conditions require preservation of copyright and license notices. Contributors provide an express grant of patent rights. Licensed works, modifications, and larger works may be distributed under different terms and without source code.