MolMIM:
MolMIM is a latent variable model developed by NVIDIA[2] that is trained in an unsupervised manner over a large-scale dataset of molecules in the form of SMILES strings. MolMIM utilizes transformer architecture to learn an informative fixed-size latent space using Mutual Information Machine (MIM) learning[3]. MIM is a learning framework for a latent variable model which promotes informative and clustered latent codes. MolMIM can be used for sampling novel molecules from the model’s latent space.
This model is for research and development only.
[1]: The CMA Evolution Strategy: A Comparing Review
[2]: Improving Small Molecule Generation using Mutual Information Machine
[3]: MIM: Mutual Information Machine
MolMIM is provided under the NVIDIA AI Foundations Model Community License
Architecture Type: Encoder-Decoder
MolMIM utilizes a Perceiver encoder architecture which outputs a fixed-size representation, where molecules of various lengths are mapped into a latent space. MolMIM’s decoder architecture is a Transformer. Both encoder and decoder container 6 layers with a hidden size of 512, 8 attention heads, and a feed-forward dimension of 2048. Total number of parameters in MolMIM is 65.2M. The model was trained with A-MIM learning.
Network Architecture: Perceiver
Input Type(s): Text (Molecular Sequence)
Input Format(s): Comma Separated Values, Simplified Molecular-Input Line Entry System (SMILES)
Input Parameters: 1D
Other Properties Related to Input: Maximum input length is 128 tokens. Pretraining dataset samples were randomly split into train, validation, and test sets ( 99% / 0.5% / 0.5% ).
Output Type(s): Text, Numerical
Output Format: [SMILES]
Output Parameters: [2D]
Other Properties Related to Output: Maximum output length is 512 tokens
Runtime Engine(s):
Supported Hardware Microarchitecture Compatibility:
[Preferred/Supported] Operating System(s):
Link: ZINC-15
The ZINC15 database was used for training [Sterling and Irwin, 2015]. Approximately 1.74 billion molecules (SMILES strings) were selected from the full database meeting the following constraints: molecular weight <= 500 Daltons, LogP <= 5, the number of hydrogen bond donors <= 5, the number of hydrogen bond acceptors <= 10, and quantitative estimate of drug-likeness (QED) value >= 0.5. The compounds were filtered to ensure a maximum length of 128 characters. Train, validation, and test splits were randomly split as 99% / 0.5% / 0.5%.
** Data Collection Method by dataset
** Labeling Method by dataset
Properties (Quantity, Dataset Descriptions, Sensor(s)): 1.54B molecules with molecular weight <= 500 Daltons, LogP <= 5, with reactivity levels rated as “reactive” and purchasability “annotated.” The compounds were filtered to ensure a maximum length of 128 characters.
Link: MoleculeNet - Lipophilicity, FreeSolv, ESOL
** Data Collection Method by dataset
** Labeling Method by dataset
Properties (Quantity, Dataset Descriptions, Sensor(s)):
MoleculeNet Physical Chemistry is an aggregation of public molecular datasets. The physical chemistry portion of MoleculeNet that we used for evaluation is made up of ESOL (1128 compunds), FreeSolv (642 compunds) and Lipohilicity (4200 compunds).
Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, Vijay Pande, MoleculeNet: A Benchmark for Molecular Machine Learning, arXiv preprint, arXiv: 1703.00564, 2017.
From the MoleculeNet documentation:
Engine: Tensor(RT)
Test Hardware:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here]. Please report security vulnerabilities or NVIDIA AI Concerns here.
We used two main sets of benchmarks. The first set monitors the reconstruction accuracy on 250K clustered validation set. Here we see that our framework release is significantly better at reconstructing input molecules than the version of MolMIM released in the service. We assess exact reconstruction and approximate reconstruction accuracy computed with Tanimoto similarity of Extended Connectivity FingerPrints with a radius of 2 (ECFP4) and 2048 bits.
Model | Exact (%) | Without Chirality (%) | >=0.9 ECFP4 similarity (%) | >=0.8 ECFP4 similarity (%) | >=0.7 ECFP4 similarity (%) |
---|---|---|---|---|---|
Service MolMIM v0.0.3 | 40.12 | 80.74 | 81.11 | 82.44 | 85.87 |
MolMIM 70M v24.3 | 99.88 | 99.89 | 99.90 | 99.93 | 99.97 |
The second set of benchmarks measures sampled molecule quality, we see that the framework model has equivalent sampling quality metrics to the version of MolMIM in the service. The sampling quality metrics are defined as follows:
$$ \text{Validity} (%)= \frac{|V|}{|G|}\times100 \quad \text{Uniqueness} (%)=\frac{|U|}{|V|} \quad \text{Novelty} (%)=\frac{|N|}{|U|} \ \text{Non-identicality}=\frac{|\bar{I}|}{|V|} \quad \text{Effective novelty}=\frac{N\cap\bar{I}}{|G|} $$
where $G$ is the set of all generated molecules, $V$ is the subset of all valid molecules in $G$, $U$ is the subset of all unique molecules in $V$, $N$ is the subset of all novel molecules not present in the training set, $\bar{I}$ is the subset of all molecules that are not the seed molecule. For more details, refer to Section A.6 of MolMIM paper.
Model | Best Sampling Radius (stdev) | Validity (%) | Novelty (%) | Uniqueness (%) | Non-identicality (%) | Effective Novelty (%) |
---|---|---|---|---|---|---|
Service MolMIM v0.0.3 | 1.0 | 100.0 | 70.0 | 98.0 | 96.0 | 68.0 |
MolMIM 70M v24.3 | 2.0 | 99.0 | 69.0 | 100.0 | 100.0 | 68.0 |
Training speed was tested on DGX-A100 systems on GPUs with 80GB of memory. It took 1 day and 14 hours to train MolMIM to convergence on 32 GPUs on the ZINC-15 dataset, which required just over 1 epoch using a batch size of 2,048 per GPU (2,048*32=65,536 total global batch size per step). Gradient accumulation was not used in training. Note as well we are currently training with 32 bit precision for this release. For future releases we will experiment with other precisions to attempt to match our current state of the art accuracy while improving throughput.
When calculating how long training runs will take, note that validation is significantly slower than training. We downsampled our validation set to 250,000 molecules so that it would take 4 steps to complete (in parallel across 32 GPUs). We found that this was a sufficient number of molecules to get a reasonable idea of performance on larger sets. Each GPU processing a single batch of 2048 molecules takes approximately 20 seconds, so in parallel with 2 nodes and 8 GPUs it took about 1:20 to process the validation set. The reason for this is that in validation we calculate molecular accuracy, which involves a full auto-regressive sampling of the decoder (128 tokens) to generate a molecule which we can compare to the input. Because of this we limited our validation check to every 2500 steps. Training on the other hand was significantly faster, at approximately 3 steps per second. More detailed timing measurements for training step time with different GPU configurations can be found in the following figures.
The numbers in the following table can gvie a sense for how training scales to more nodes and GPUs.
Number of A100 GPUs/node | Number of Nodes | Number of A100 GPUs (total) | Training Throughput (molecules/second) |
---|---|---|---|
1 | 1 | 1 | 7360.649 |
8 | 1 | 8 | 53579.659 |
8 | 2 | 16 | 102541.273 |