NGC | Catalog
CatalogModelsMegaMolBART v0.1

MegaMolBART v0.1

For downloads and more information, please view on a desktop device.
Logo for MegaMolBART v0.1


A BART transformer language model on molecular SMILES strings



Latest Version



April 4, 2023


153.37 MB

Model Overview

MegaMolBART is a model that understands chemistry and can be used for a variety of cheminformatics applications in drug discovery. The embeddings from its encoder can be used as features for predictive models. Alternatively, the encoder and decoder can be used together to generate novel molecules by sampling the model's latent space.

Model Architecture

The model is a seq2seq transformer called a Bidirectional and Auto-Regressive Transformer (BART) [1]. The version of the model that is specific for molecules, is called Chemformer [2]. Pre-norm layer normalization and GELU activation are used throughout. This version of MegaMolBART has 8 layers, 4 attention heads, and a latent space dimension of 256. Dropout was 0.1.


MegaMolBART was written in the Megatron framework. It was trained with data parallelism on 32 V100 GPUs (4 nodes x 8 GPUs) for approximately 610,000 iterations (~24 hours) using a batch size of 512 molecules per GPU. The original transformer learning rate schedule was used, with a starting value of 0.0001 and 6100 warmup steps. Adam optimization was used with parameters beta1=0.9 and beta2=0.999. Cross-entropy loss was used to train the model.


The ZINC-15 database was used for training [3]. Approximately 500M molecules (SMILES strings) were selected from tranches meeting the following constraints: molecular weight <= 500 Daltons, LogP <= 5, reactivity level was reactive, and purchasability was annotated. The compounds were filtered to ensure a maximum length of 512 characters. Train, validation, and test splits were randomly split as 99%/0.5%/0.5%.

Data augmentation during training was performed via masking and SMILES randomization as described previously [2].


Model performance was evaluated using the test split from ZINC-15. For each input compound, 10 compounds were generated by sampling the local region from the model's latent space. The results were evaluated according to the following metrics.

  1. Validity: percentage of generated molecules (10 per input molecule) that are valid SMILES
  2. Uniqueness: of the valid molecules, percentage of the sampled group that are unique
  3. Novelty: of the valid molecules, percentage of the sampled group that are not in the training data
  4. Nearest Neighbor Correlation: the ranked correlation between pairwise distances calculated from Morgan Fingerprints (Tanimoto distance) and model embeddings (euclidean distance)
  5. Modelability: the ratio of mean squared errors from predicting molecular properties (LogP and molecular weight) using model embeddings or Morgan Fingerprints as features. The following models were tested: linear regression, SVM, Random Forest. The higher the ratio, the better the model performs on embeddings perform relative to the fingerprints.

Validity: 0.98902

Uniqueness: 0.31911

Novelty: 0.36019

Nearest Neighbor Correlation: 0.17891


Linear Regression: 4.14656

SVM: 3.94546

Random Forest: 1.85953

How to Use this Model

MegaMolBART can be run on hardware with access to any NVIDIA GPU with memory greater than 8 GB. The model can also be used with NVIDIA Clara Discovery using the MegaMolBART gRPC Service. The container runs a gRPC service that can generate one of the following using an input SMILES string or list of SMILES strings as the source:

  1. Embeddings from the latent space
  2. SMILES strings sampled around the region of a single input SMILES molecule
  3. SMILES strings sampled at regularly spaced intervals between two input SMILES molecules

See the file inside the tutorial folder of the Cheminformatics repo for instructions on using the model with Clara Discovery.


Inputs to the model are expected to be valid SMILES strings of maximum character length 512.


Outputs are either a list of SMILES strings or a 1D vector of embeddings. The first two values contained in the embedding vector correspond to its dimension. These values must be extracted and used to reshape the vector if the original form is desired.


No known limitations to this model.


  1. Lewis, M. et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Arxiv (2019).
  2. Irwin, R., Dimitriadis, S., He, J. & Bjerrum, E. Chemformer: A Pre-Trained Transformer for Computational Chemistry. (n.d.) doi:10.33774/chemrxiv-2021-v2pnn.
  3. Sterling and Irwin, J. Chem. Inf. Model, 2015


Apache License 2.0

A permissive license whose main conditions require preservation of copyright and license notices. Contributors provide an express grant of patent rights. Licensed works, modifications, and larger works may be distributed under different terms and without source code.