NGC | Catalog

OpenFold

Logo for OpenFold
Description
OpenFold predicts protein structures from protein sequence inputs and optional multiple sequence alignments (MSAs) and template(s).
Publisher
NVIDIA
Latest Version
finetuned_1.2
Modified
February 29, 2024
Size
330.63 MB

OpenFold

Model Overview

Description:

This is an OpenFold implementation under BioNeMo framework, derived from public OpenFold and DeepMind AlphaFold-2. This checkpoint was fine-tuned on the initial training checkpoint from the public OpenFold team. OpenFold predicts protein structures from protein sequence inputs and optional multiple sequence alignments (MSAs) and template(s). This implementation supports initial training, fine-tuning and inference under BioNeMo framework. Detailed examples can be found under examples/protein/openfold within BioNeMo framework repository.

Users are advised to read the licensing terms under public OpenFold and DeepMind AlphaFold-2 repositories as well as our copyright text.

This model is ready for commercial use.

References:

  1. OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization | Link
  2. Highly accurate protein structure prediction with AlphaFold | Link
  3. OpenProteinSet: Training data for structural biology at scale | Link

Model Architecture:

Architecture Type: Pose Estimation
Network Architecture: AlphaFold-2

Input:

Input Type(s): Protein Sequence, (optional) Multiple Sequence Alignment(s) and (optional) Strutural Template(s)
Input Format(s): None, a3m (text file), hhr (text file)
Input Parameters: 1D
Other Properties Related to Input: None

Output:

Output Type(s): Protrin Structure Pose(s), (optional) Confidence Metrics, (optional) Embeddings
Output Format: PDB (text file), Pickle file, Pickle file
Output Parameters: 3D
Other Properties Related to Output: Pose (num_atm_ x 3), (optional) Confidence Metric: pLDDT (num_res_) and PAE (num_res_ x num_res_), (optional) Embeddings (num_res_ x emb_dims, or num_res_ x num_res_ x emb_dims)

Software Integration:

Runtime Engine(s):

  • NeMo, BioNeMo

Supported Hardware Microarchitecture Compatibility:

  • [Ampere]
  • [Hopper]

[Preferred/Supported] Operating System(s):

  • [Linux]

Model Version(s):

OpenFold under BioNeMo framework

Training & Evaluation:

Training Dataset:

Link: PDB-mmCIF dataset, OpenProteinSet
Data Collection Method by dataset

  • PDB-mmCIF dataset: [Automatic] and [Human]
  • OpenProteinSet: [Automatic]

Labeling Method by dataset

  • [Not Applicable]

Properties: PDB-mmCIF dataset: 200k samples of experimental protein structures. OpenProteinSet: 269k samples on sequence alignments.
Dataset License(s): PDB-mmCIF dataset: CC0 1.0 Universal. OpenProteinSet: CC BY 4.0.

Evaluation Dataset:

Link: PDB-mmCIF dataset, OpenProteinSet
Data Collection Method by dataset

  • PDB-mmCIF dataset: [Automatic] and [Human]
  • OpenProteinSet: [Automatic]

Labeling Method by dataset

  • [Not Applicable]

Properties: PDB-mmCIF dataset: 200k samples of experimental protein structures. OpenProteinSet: 269k samples on sequence alignments.
Dataset License(s): PDB-mmCIF dataset: CC0 1.0 Universal. OpenProteinSet: CC BY 4.0.

Inference:

Engine: NeMo, BioNeMo, Triton
Test Hardware:

  • [Ampere]
  • [Hopper]

Benchmarks

Accuracy benchmark

There are two stages of training OpenFold: initial-training and fine-tuning. 4 checkpoints are available for download: a pair of initial-training and fintuining available publicly and converted to .nemo format, and another pair of in-house trained checkpoints. All checkpoints are benchmarked against CAMEO benchmark with proteins dated from 2021-09-17 to 2021-12-11. This validation set is available through training data.

Benchmark results (lDDT-cα) for checkpoints trained using BioNeMo framework:

initial-training fine-tuning*
CAMEO 2021-09-17 to 2021-12-11 89.82 91.0

*this checkpoint was fine-tuned starting from public inital-training, and is available for download via NGC (using the download_models.py script).

Training Performance Benchmarks

Training speed was tested 16 DGX-A100 (128 GPUs) with 80GB of memory, with a single protein (micro batch size of 1) per GPU.

initial-training fine-tuning
number of steps 80,000 12,000
training step time (s) 6.06 24.91

Note that in the default configuration of OpenFold training shipped in BioNeMo, there is validation every 200 steps, which takes about 3 minutes of duration. Initial training therefore takes approximately 6.5 days.

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.