ESM-2nv 650M

ESM-2nv 650M

Logo for ESM-2nv 650M
Description
A 650 million parameter BERT model that has model weights converted from Huggingface into NeMo Framework.
Publisher
NVIDIA
Latest Version
1.0
Modified
November 27, 2023
Size
1.12 GB

Model Overview

ESM-2nv is a model that has been trained on protein sequences. The embeddings from its encoder can be used as features for predictive models.

Model Architecture

ESM-2nv was developed using the BioNeMo framework by converting the ESM-2 model checkpoints from HuggingFace into the NeMo framework [2]. The underlying ESM-2 model uses an architecture called Bidirectional Encoder Representations from Transformers (BERT) and improves upon the ESM-1b model [2], [3] with various features including architectural changes, differences in embeddings, and custom transformations describe below.

ESM-2nv models are in principle compatible with ESM-2 checkpoints, meaning that ESM-2 public checkpoints from HuggingFace can be loaded into ESM-2nv architectures of similar size. The 650M model has 33 layers, 20 attention heads, a hidden space dimension of 1280, and contains 650M parameters.

Improvements in ESM-2nv over ESM-2

ESM-2nv models are HuggingFace ESM-2 model checkpoints that have been converted into NeMo-optimized BioNeMo model checkpoints. ESM-2nv achieves the same performance benchmarks as ESM-2 but is optimized to provide faster training and inference on NVIDIA GPUs. ESM-2nv enables customization for pre-training and inference parameters through YAML configuration files at time of model instantiation. A complete, curated pre-training dataset is provided with the BioNeMo framework release of ESM-2nv to facilitate pre-training from scratch.

Differences in ESM-2nv compared to ESM-2

Unlike ESM-2 pre-training data, the curated pre-training dataset provided with ESM-2nv release contains hits for de novo proteins, since sequences in UniRef100, UniRef90, and UniRef50 with high sequence similarity to a non-public 81 de novo proteins [2] are not filtered.

Training

ESM-2nv can be trained from scratch using the provided dataset and code. The ESM-2nv 650M checkpoints in the current release have been converted from the models provided by Lin, et. al [2] and made available at HuggingFace 650M model

Dataset and Processing

UniRef50 release 04/20221 was used for training [4]. The representative sequence for each cluster was selected, resulting in approximately 49M protein sequences. The sequences were randomly split with 250K sequences in validation and the remaining in train. All train sequences that matched a validation sequence with 50% sequence identity were removed from the train set, resulting in 49,425,807 train sequences. A sampling dataset of UniRef90 sequences was created based on any UniRef90 representatives and cluster members that had complete sequences available from UniRef90 or UniRef100, and filtered to UniRef90 sequences for clusters that corresponded to the UniRef50 train set. This UniRef90 dataset was combined with the filtered UniRef50 training dataset to create the sampling fasta file. A mapping file was created to enable rapid replacement of UniRef50 sequences with a sequence sampled uniformly from the corresponding records in the sampling fasta file during each training update. The UniRef50 training fasta was sorted in the order of occurrence of records in column 1 of the mapping file. The UniRef90+UniRef50 sampling fasta file was sorted in the order of occurrence of records in column 2 of the mapping file. Protein sequences longer than 1024 amino acids were cropped to 1023 from sequence start {cite:p}devlin2018bert.

How to Use this Model

  • Compute embeddings from input protein sequences. Embeddings are created for each amino acid in the protein sequence. Embeddings can then be used for downstream tasks such as prediction of secondary structure, subcellular localization, or others, as detailed by the FLIP benchmark tasks [1].
  • The recommended way to consume this model is to use it inside BioNeMo Framework Container. BioNeMo is a Framework for training and deploying large biomolecular language models at supercomputing scale for the discovery and development of therapeutics.
  • Find out more about BioNeMo and it's applications here
  • Click here for example tutorials on how to use ESM-2nv model in BioNeMo Framework.

Suggested Reading

  1. Christian Dallago, Jody Mou, Kadina E. Johnston, Bruce J. Wittmann, Nicholas Bhattacharya, Samuel Goldman, Ali Madani, and Kevin K. Yang. Flip: benchmark tasks in fitness landscape inference for proteins. 2022. doi:10.1101/2021.11.09.467890.
  2. Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, and others. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
  3. Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C Lawrence Zitnick, Jerry Ma, and Rob Fergus. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U. S. A., April 2021. doi:10.1073/pnas.2016239118.
  4. UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res., 49(D1):D480–D489, January 2021. doi:10.1093/nar/gkaa1100.

License

Apache License 2.0

A permissive license whose main conditions require preservation of copyright and license notices. Contributors provide an express grant of patent rights. Licensed works, modifications, and larger works may be distributed under different terms and without source code.