This is an OpenFold implementation under BioNeMo framework, derived from public OpenFold and DeepMind AlphaFold-2. This checkpoint was fine-tuned on the initial training checkpoint from the public OpenFold team. OpenFold predicts protein structures from protein sequence inputs and optional multiple sequence alignments (MSAs) and template(s). This implementation supports initial training, fine-tuning and inference under BioNeMo framework. Detailed examples can be found under examples/protein/openfold
within BioNeMo framework repository.
Users are advised to read the licensing terms under public OpenFold and DeepMind AlphaFold-2 repositories as well as our copyright text.
This model is ready for commercial use.
Architecture Type: Pose Estimation
Network Architecture: AlphaFold-2
Input Type(s): Protein Sequence, (optional) Multiple Sequence Alignment(s) and (optional) Strutural Template(s)
Input Format(s): None, a3m (text file), hhr (text file)
Input Parameters: 1D
Other Properties Related to Input: None
Output Type(s): Protrin Structure Pose(s), (optional) Confidence Metrics, (optional) Embeddings
Output Format: PDB (text file), Pickle file, Pickle file
Output Parameters: 3D
Other Properties Related to Output: Pose (num_atm_ x 3), (optional) Confidence Metric: pLDDT (num_res_) and PAE (num_res_ x num_res_), (optional) Embeddings (num_res_ x emb_dims, or num_res_ x num_res_ x emb_dims)
Runtime Engine(s):
Supported Hardware Microarchitecture Compatibility:
[Preferred/Supported] Operating System(s):
OpenFold under BioNeMo framework
Link: PDB-mmCIF dataset, OpenProteinSet
Data Collection Method by dataset
Labeling Method by dataset
Properties: PDB-mmCIF dataset: 200k samples of experimental protein structures. OpenProteinSet: 269k samples on sequence alignments.
Dataset License(s): PDB-mmCIF dataset: CC0 1.0 Universal. OpenProteinSet: CC BY 4.0.
Link: PDB-mmCIF dataset, OpenProteinSet
Data Collection Method by dataset
Labeling Method by dataset
Properties: PDB-mmCIF dataset: 200k samples of experimental protein structures. OpenProteinSet: 269k samples on sequence alignments.
Dataset License(s): PDB-mmCIF dataset: CC0 1.0 Universal. OpenProteinSet: CC BY 4.0.
Engine: NeMo, BioNeMo, Triton
Test Hardware:
There are two stages of training OpenFold: initial-training and fine-tuning. 4 checkpoints are available for download: a pair of initial-training and fintuining available publicly and converted to .nemo format, and another pair of in-house trained checkpoints. All checkpoints are benchmarked against CAMEO benchmark with proteins dated from 2021-09-17 to 2021-12-11. This validation set is available through training data.
Benchmark results (lDDT-cα) for checkpoints trained using BioNeMo framework:
initial-training | fine-tuning* | |
---|---|---|
CAMEO 2021-09-17 to 2021-12-11 | 89.82 | 91.0 |
*this checkpoint was fine-tuned starting from public inital-training, and is available for download via NGC (using the download_models.py
script).
Training speed was tested 16 DGX-A100 (128 GPUs) with 80GB of memory, with a single protein (micro batch size of 1) per GPU.
initial-training | fine-tuning | |
---|---|---|
number of steps | 80,000 | 12,000 |
training step time (s) | 6.06 | 24.91 |
Note that in the default configuration of OpenFold training shipped in BioNeMo, there is validation every 200 steps, which takes about 3 minutes of duration. Initial training therefore takes approximately 6.5 days.
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.