NGC | Catalog
CatalogModelsMolMIM NVIDIA


Logo for MolMIM NVIDIA
MolMIM allows users to generate molecules similar to the seed molecule in SMILES format by randomly perturbing the latent space encoded from a seed molecule and decoding that back into SMILES.
Latest Version
March 26, 2024
249.15 MB

Model Overview



  • Allows users to generate molecules similar to the seed molecule in SMILES format by randomly perturbing (eg by adding 0 centered gaussian noise with a desired variance) the latent space encoded from a seed molecule and decoding that back into SMILES.
  • performs optimization with the CMA-ES algorithm[1] in the model’s latent space and sample molecules with improved values of the desired scoring function.

MolMIM is a latent variable model developed by NVIDIA[2] that is trained in an unsupervised manner over a large-scale dataset of molecules in the form of SMILES strings. MolMIM utilizes transformer architecture to learn an informative fixed-size latent space using Mutual Information Machine (MIM) learning[3]. MIM is a learning framework for a latent variable model which promotes informative and clustered latent codes. MolMIM can be used for sampling novel molecules from the model’s latent space.

This model is for research and development only.


[1]: The CMA Evolution Strategy: A Comparing Review

[2]: Improving Small Molecule Generation using Mutual Information Machine

[3]: MIM: Mutual Information Machine


MolMIM is provided under the NVIDIA AI Foundations Model Community License

Model Architecture:

Architecture Type: Encoder-Decoder

MolMIM utilizes a Perceiver encoder architecture which outputs a fixed-size representation, where molecules of various lengths are mapped into a latent space. MolMIM’s decoder architecture is a Transformer. Both encoder and decoder container 6 layers with a hidden size of 512, 8 attention heads, and a feed-forward dimension of 2048. Total number of parameters in MolMIM is 65.2M. The model was trained with A-MIM learning.

Network Architecture: Perceiver

Input Type(s): Text (Molecular Sequence)
Input Format(s): Comma Separated Values, Simplified Molecular-Input Line Entry System (SMILES)
Input Parameters: 1D
Other Properties Related to Input: Maximum input length is 128 tokens. Pretraining dataset samples were randomly split into train, validation, and test sets ( 99% / 0.5% / 0.5% ).


Output Type(s): Text, Numerical
Output Format: [SMILES]
Output Parameters: [2D]
Other Properties Related to Output: Maximum output length is 512 tokens

Software Integration:

Runtime Engine(s):

  • Triton Inference Server

Supported Hardware Microarchitecture Compatibility:

  • Ampere
  • L40

[Preferred/Supported] Operating System(s):

  • Linux

Model Version(s):

  • MolMIM-70M-24.3
    • Trained with log variance sampling loss, but left out the portion of the encoder that used the sampled log variance as an input to the hiddens -> z_mean transformation. We found that this was not needed to achieve performance on the tasks we are currently measuring, and simplified radius sampling.
    • See the molmim model training notebook for more information about how the model was trained and which config was used.

Training & Evaluation:

Training Dataset:

Link: ZINC-15

The ZINC15 database was used for training [Sterling and Irwin, 2015]. Approximately 1.74 billion molecules (SMILES strings) were selected from the full database meeting the following constraints: molecular weight <= 500 Daltons, LogP <= 5, the number of hydrogen bond donors <= 5, the number of hydrogen bond acceptors <= 10, and quantitative estimate of drug-likeness (QED) value >= 0.5. The compounds were filtered to ensure a maximum length of 128 characters. Train, validation, and test splits were randomly split as 99% / 0.5% / 0.5%.

** Data Collection Method by dataset

  • Not Applicable

** Labeling Method by dataset

  • Not Applicable

Properties (Quantity, Dataset Descriptions, Sensor(s)): 1.54B molecules with molecular weight <= 500 Daltons, LogP <= 5, with reactivity levels rated as “reactive” and purchasability “annotated.” The compounds were filtered to ensure a maximum length of 128 characters.

Evaluation Dataset:

Link: MoleculeNet - Lipophilicity, FreeSolv, ESOL

** Data Collection Method by dataset

  • Hybrid: Human & Automatic/Sensors

** Labeling Method by dataset

  • Hybrid: Human & Automated

Properties (Quantity, Dataset Descriptions, Sensor(s)):

MoleculeNet Physical Chemistry is an aggregation of public molecular datasets. The physical chemistry portion of MoleculeNet that we used for evaluation is made up of ESOL (1128 compunds), FreeSolv (642 compunds) and Lipohilicity (4200 compunds).

Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, Vijay Pande, MoleculeNet: A Benchmark for Molecular Machine Learning, arXiv preprint, arXiv: 1703.00564, 2017.

From the MoleculeNet documentation:

  • ESOL is made up of water solubility data(log solubility in mols per litre) for common organic small molecules.
  • FreeSolv is made up of experimental and calculated hydration free energy of small molecules in water.
  • Lipophilicity is composed of experimental results of octanol/water distribution coefficient(logD at pH 7.4).


Engine: Tensor(RT)
Test Hardware:

  • Ampere
  • L40

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here]. Please report security vulnerabilities or NVIDIA AI Concerns here.

Accuracy Benchmarks

We used two main sets of benchmarks. The first set monitors the reconstruction accuracy on 250K clustered validation set. Here we see that our framework release is significantly better at reconstructing input molecules than the version of MolMIM released in the service. We assess exact reconstruction and approximate reconstruction accuracy computed with Tanimoto similarity of Extended Connectivity FingerPrints with a radius of 2 (ECFP4) and 2048 bits.

Model Exact (%) Without Chirality (%) >=0.9 ECFP4 similarity (%) >=0.8 ECFP4 similarity (%) >=0.7 ECFP4 similarity (%)
Service MolMIM v0.0.3 40.12 80.74 81.11 82.44 85.87
MolMIM 70M v24.3 99.88 99.89 99.90 99.93 99.97

The second set of benchmarks measures sampled molecule quality, we see that the framework model has equivalent sampling quality metrics to the version of MolMIM in the service. The sampling quality metrics are defined as follows:

$$ \text{Validity} (%)= \frac{|V|}{|G|}\times100 \quad \text{Uniqueness} (%)=\frac{|U|}{|V|} \quad \text{Novelty} (%)=\frac{|N|}{|U|} \ \text{Non-identicality}=\frac{|\bar{I}|}{|V|} \quad \text{Effective novelty}=\frac{N\cap\bar{I}}{|G|} $$

where $G$ is the set of all generated molecules, $V$ is the subset of all valid molecules in $G$, $U$ is the subset of all unique molecules in $V$, $N$ is the subset of all novel molecules not present in the training set, $\bar{I}$ is the subset of all molecules that are not the seed molecule. For more details, refer to Section A.6 of MolMIM paper.

Model Best Sampling Radius (stdev) Validity (%) Novelty (%) Uniqueness (%) Non-identicality (%) Effective Novelty (%)
Service MolMIM v0.0.3 1.0 100.0 70.0 98.0 96.0 68.0
MolMIM 70M v24.3 2.0 99.0 69.0 100.0 100.0 68.0

Training Performance Benchmarks

Training speed was tested on DGX-A100 systems on GPUs with 80GB of memory. It took 1 day and 14 hours to train MolMIM to convergence on 32 GPUs on the ZINC-15 dataset, which required just over 1 epoch using a batch size of 2,048 per GPU (2,048*32=65,536 total global batch size per step). Gradient accumulation was not used in training. Note as well we are currently training with 32 bit precision for this release. For future releases we will experiment with other precisions to attempt to match our current state of the art accuracy while improving throughput.

When calculating how long training runs will take, note that validation is significantly slower than training. We downsampled our validation set to 250,000 molecules so that it would take 4 steps to complete (in parallel across 32 GPUs). We found that this was a sufficient number of molecules to get a reasonable idea of performance on larger sets. Each GPU processing a single batch of 2048 molecules takes approximately 20 seconds, so in parallel with 2 nodes and 8 GPUs it took about 1:20 to process the validation set. The reason for this is that in validation we calculate molecular accuracy, which involves a full auto-regressive sampling of the decoder (128 tokens) to generate a molecule which we can compare to the input. Because of this we limited our validation check to every 2500 steps. Training on the other hand was significantly faster, at approximately 3 steps per second. More detailed timing measurements for training step time with different GPU configurations can be found in the following figures.

The numbers in the following table can gvie a sense for how training scales to more nodes and GPUs.

Number of A100 GPUs/node Number of Nodes Number of A100 GPUs (total) Training Throughput (molecules/second)
1 1 1 7360.649
8 1 8 53579.659
8 2 16 102541.273