MegaMolBART is a model that understands chemistry and can be used for a variety of cheminformatics applications in drug discovery. The embeddings from its encoder can be used as features for predictive models. Alternatively, the encoder and decoder can be used together to generate novel molecules by sampling the model's latent space.
The model is a seq2seq transformer called a Bidirectional and Auto-Regressive Transformer (BART) [1]. The version of the model that is specific for molecules, is called Chemformer [2]. Pre-norm layer normalization and GELU activation are used throughout. This version of MegaMolBART has 8 layers, 4 attention heads, and a latent space dimension of 256. Dropout was 0.1.
MegaMolBART was written in the Megatron framework. It was trained with data parallelism on 32 V100 GPUs (4 nodes x 8 GPUs) for approximately 610,000 iterations (~24 hours) using a batch size of 512 molecules per GPU. The original transformer learning rate schedule was used, with a starting value of 0.0001 and 6100 warmup steps. Adam optimization was used with parameters beta1=0.9 and beta2=0.999. Cross-entropy loss was used to train the model.
The ZINC-15 database was used for training [3]. Approximately 500M molecules (SMILES strings) were selected from tranches meeting the following constraints: molecular weight <= 500 Daltons, LogP <= 5, reactivity level was reactive, and purchasability was annotated. The compounds were filtered to ensure a maximum length of 512 characters. Train, validation, and test splits were randomly split as 99%/0.5%/0.5%.
Data augmentation during training was performed via masking and SMILES randomization as described previously [2].
Model performance was evaluated using the test split from ZINC-15. For each input compound, 10 compounds were generated by sampling the local region from the model's latent space. The results were evaluated according to the following metrics.
Validity: 0.98902
Uniqueness: 0.31911
Novelty: 0.36019
Nearest Neighbor Correlation: 0.17891
Modelability
Linear Regression: 4.14656
SVM: 3.94546
Random Forest: 1.85953
MegaMolBART can be run on hardware with access to any NVIDIA GPU with memory greater than 8 GB. The model can also be used with NVIDIA Clara Discovery using the MegaMolBART gRPC Service. The container runs a gRPC service that can generate one of the following using an input SMILES string or list of SMILES strings as the source:
See the file Tutorial.md inside the tutorial folder of the Cheminformatics repo for instructions on using the model with Clara Discovery.
Inputs to the model are expected to be valid SMILES strings of maximum character length 512.
Outputs are either a list of SMILES strings or a 1D vector of embeddings. The first two values contained in the embedding vector correspond to its dimension. These values must be extracted and used to reshape the vector if the original form is desired.
No known limitations to this model.
A permissive license whose main conditions require preservation of copyright and license notices. Contributors provide an express grant of patent rights. Licensed works, modifications, and larger works may be distributed under different terms and without source code.