ProtT5nv

Logo for ProtT5nv
Description
A T5 model developed using the BioNeMo framework starting from a model pre-trained on NLP data from NeMo Framework.
Publisher
NVIDIA
Latest Version
1.0
Modified
November 27, 2023
Size
704.42 MB

Model Overview

ProtT5nv is a model that has been trained on protein sequences. Its encoder output can be used for predictive models, while sequence translation tasks can utilize the entire encoder-decoder architecture.

Model Architecture

ProtT5nv was developed using the BioNeMo framework starting from a model pre-trained on NLP data. The model uses an architecture called T5 and is based on the original ProtT5 model [1][2]. The model has 12 layers, 12 attention heads, a hidden space dimension of 768, and contains 192M parameters. The maximum sequence length supported by ProtT5 is 512 tokens. Pre-norm layer normalization and GELU activation are used throughout.

ProtT5nv has a maximum sequence length of 512 for both the encoder and the decoder. Proteins whose amino acid sequence is longer than this are truncated at 512 amino acids.

Training

Using the T5 model trained on NLP data, the model was then further pre-trained with protein sequences using data parallelism on 224 V100 GPUs for 58 epochs (approximately 970189 iterations) using a micro batch size of 12 molecules per GPU. The total training time was approximately 120 wall-clock hours. Inverse square root annealing was used, with a minimum learning rate of 0.0 and ~10000 warmup steps. Fused Adam optimization was used with parameters β1=0.9 β2=0.999 and weight decay=0.01. Categorical cross-entropy loss was used to train the model. Dropout was set to 0.1 during training.

Dataset and Processing

UniRef50 (release 05/2022) was used for training [3]. The reference sequence for each cluster was selected, with sequences longer than the maximum sequence length of 512 removed, resulting in approximately 46M protein sequences. The sequences were randomly split with 4.35K sequences in validation, 875K sequences in test, and the remaining in train. Data masking was performed as described previously [2].

How to Use this Model

  • For each protein sequence, the model can produce an embedding from the encoder that is suitable for representation learning. For sequence translation tasks, both the encoder and decoder are utilized.
  • The recommended way to consume this model is to use it inside BioNeMo Framework Container. BioNeMo is a Framework for training and deploying large biomolecular language models at supercomputing scale for the discovery and development of therapeutics.
  • Find out more about BioNeMo and it's applications here

Suggested Reading

Learn more about ProtT5 here.

References

  1. Elnaggar A., Heinzinger M., Dallago C., Rehawi G., Wang Y., Jones L., Gibbs T., Feher T., Angerer C., Steinegger M., Bhowmik D., and Rost B., "ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing", ArXiv, 2020, doi.
  2. Raffel C., Shazeer N., Roberts A., Lee K., Narang S., Matena M., Zhou Y., Li W., and Liu P. J., "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer", ArXiv, 2019, doi.
  3. UniProt Consortium, "UniProt: the universal protein knowledgebase in 2021", Nucleic Acids Res., 2021, doi.

License

Apache License 2.0

A permissive license whose main conditions require preservation of copyright and license notices. Contributors provide an express grant of patent rights. Licensed works, modifications, and larger works may be distributed under different terms and without source code.