Linux / amd64
GenMol is a masked diffusion model1 trained on molecular Sequential Attachment-based Fragment Embedding (SAFE) representations2 for fragment-based molecule generation, which can serve as a generalist model for various drug discovery tasks, including De Novo generation, linker design, motif extension, scaffold decoration/morphing, hit generation, and lead optimization.
This model is ready for commercial use.
This NIM is licensed under NVIDIA AI Foundation Models Community License Agreement. By using this NIM, you accept the terms and conditions of this license. You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.
You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.
@misc{sahoo2024simpleeffectivemaskeddiffusion,
title={Simple and Effective Masked Diffusion Language Models},
author={Subham Sekhar Sahoo and Marianne Arriola and Yair Schiff and Aaron Gokaslan and Edgar Marroquin and Justin T Chiu and Alexander Rush and Volodymyr Kuleshov},
year={2024},
eprint={2406.07524},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.07524},
}
@misc{noutahi2023gottasafenewframework,
title={Gotta be SAFE: A New Framework for Molecular Design},
author={Emmanuel Noutahi and Cristian Gabellini and Michael Craig and Jonathan S. C Lim and Prudencio Tossou},
year={2023},
eprint={2310.10773},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2310.10773},
}
Architecture Type: Transformer
Network Architecture: BERT
Input Type(s): Text (Molecular Sequence), Number (Molecules to generate, SoftMax temperature scaling factor, randomness factor, diffusion step-size), Enumeration (Scoring method), Binary (Showing unique molecules only)
Input Format(s): Text: String (Sequential Attachment-based Fragment Embedding (SAFE)); Number: Integer, FP32; Enumeration: String (QED, LogP); Binary: Boolean
Input Parameters: 1D
Other Properties Related to Input: Maximum input length is 512 tokens.
Output Type(s): Text (List of molecule sequences), Number (List of scores)
Output Format: Text: Array of string (Sequential Attachment-based Fragment Embedding (SAFE)); Number: Array of FP32 (Scores)
Output Parameters: 2D
Other Properties Related to Output: Maximum output length is 512 tokens.
Runtime Engine(s):
PyTorch >= 2.5.1
Supported Hardware Microarchitecture Compatibility:
NVIDIA Ampere
NVIDIA Ada Lovelace
NVIDIA Hopper
NVIDIA Grace Hopper
[Preferred/Supported] Operating System(s):
Linux
GenMol v1.0
Link: SAFE-GPT GitHub, HuggingFace,
Data Collection Method by dataset: Automated
Labeling Method by dataset: Automated
Properties: 1.1B SAFE strings consist of various molecule types (drug-like compounds, peptides, multi-fragment molecules, polymers, reagents and non-small molecules).
Dataset License(s): CC-BY-4.0
Link: SAFE-DRUGS GitHub, HuggingFace
Data Collection Method by dataset: Not Applicable
Labeling Method by dataset: Not Applicable
Properties: SAFE-DRUGS consists of 26 known therapeutic drugs.
Dataset License(s): CC-BY-4.0
Engine: PyTorch
Test Hardware: A6000, A100, L40, L40S, H100
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards here.
Please report security vulnerabilities or NVIDIA AI Concerns here.