NGC | Catalog
CatalogResourcesMoFlow for PyTorch

MoFlow for PyTorch

For downloads and more information, please view on a desktop device.
Logo for MoFlow for PyTorch

Description

MoFlow is a model for molecule generation that leverages Normalizing Flows. This implementation is an optimized version of the model in the original paper.

Publisher

NVIDIA Deep Learning Examples

Use Case

Other

Framework

Other

Latest Version

22.11.1

Modified

January 31, 2023

Compressed Size

38.21 KB

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

To benchmark the training performance on a specific number of GPUs, batch size and precision, run:

bash scripts/benchmark_training.sh <# GPUs> <batch_size> <precision>

Eg. running

./scripts/benchmark_training.sh 8 2048 amp

will measure performance for eight GPUs, batch size of 2048 per GPU and mixed precision and running:

./scripts/benchmark_training.sh 1 1024 full

will measure performance for single GPU, batch size of 1024 and full precision.

Inference performance benchmark

To benchmark the inference performance on a specific batch size and precision, run:

bash scripts/benchmark_inference.sh <batch size> <precision>

Eg. running

./scripts/benchmark_inference.sh 2048 amp

will measure performance for a batch size of 2048 and mixed precision and running:

./scripts/benchmark_inference.sh 1024 full

will measure performance for a batch size of 1024 and full precision.

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference.

Training accuracy results

Training accuracy: NVIDIA A100 (8x A100 80GB)

Our results were obtained by running the scripts/train.sh training script in the PyTorch 22.11 NGC container on NVIDIA A100 (8x A100 80GB) GPUs. The values presented below were averaged over 20 experiments.

GPUs Batch size / GPU NUV - TF32 NUV - mixed precision Time to train - TF32 Time to train - mixed precision Time to train speedup (TF32 to mixed precision)
1 512 89.63 % 87.83 % 5h8min 4h0min 1.28x
8 512 87.03 % 87.90 % 48min 40min 1.20x
Training stability test

The MoFlow model was trained for 300 epochs starting from 20 different initial random seeds. Every five training epochs, the model was evaluated by generating a small sample of molecules (100 molecules per GPU), and validity and uniqueness were calculated. The training was performed in the PyTorch 22.11 Docker container on NVIDIA DGX A100 with 8x A100 80GB GPUs with AMP and CUDA graph capture enabled. The following table summarizes the results of the stability test.

The following table displays the validity and uniqueness scores after every 50 epochs for different initial random seeds.

epoch validity mean validity std validity min validity max validity median uniqueness mean uniqueness std uniqueness min uniqueness max uniqueness median
50 68.22 5.25 57.38 74.75 69.50 93.64 8.22 62.56 99.82 95.30
100 76.91 4.23 69.50 84.38 77.50 99.39 0.92 96.31 100.00 99.83
150 80.48 3.80 73.88 88.25 81.75 99.58 0.78 96.64 100.00 99.85
200 83.87 3.98 77.00 90.62 84.44 99.76 0.38 98.81 100.00 100.00
250 86.08 4.46 77.12 93.12 86.56 99.87 0.21 99.27 100.00 100.00
300 87.29 3.70 77.75 93.38 87.69 99.82 0.30 98.70 100.00 99.93

Training performance results

Training performance: NVIDIA A100 (8x A100 80GB)

Our results were obtained by running the scripts/benchmark_training.sh training script in the PyTorch 22.11 NGC container on NVIDIA A100 (8x A100 80GB) GPUs. Performance numbers (in molecules per second) were averaged over 190 iterations after 10 warm-up steps.

GPUs Batch size / GPU Throughput - TF32 Throughput - mixed precision Throughput speedup (TF32 - mixed precision) Weak scaling - TF32 Weak scaling - mixed precision
1 512 3499.35 4524.15 1.29
1 1024 3883.49 5392.78 1.39
1 2048 4291.29 6118.46 1.43
8 512 24108.04 29293.41 1.22 6.89 6.47
8 1024 28104.62 37365.05 1.33 7.24 6.93
8 2048 30927.04 42078.31 1.36 7.21 6.88

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance results

Inference performance: NVIDIA A100 (1x A100 80GB)

Our results were obtained by running the scripts/benchmark_inference.sh inferencing benchmarking script in the PyTorch 22.11 NGC container on the NVIDIA A100 (1x A100 80GB) GPU.

FP16

Batch size Throughput Avg Latency Avg Latency 90% Latency 95% Latency 99%
512 12524.49 41 41 41 41
1024 13871.60 74 74 74 74
2048 14386.44 142 144 144 144

TF32

Batch size Throughput Avg Latency Avg Latency 90% Latency 95% Latency 99%
512 9696.35 53 53 53 53
1024 10242.98 100 100 100 100
2048 11174.75 183 187 187 187

To achieve these same results, follow the steps in the Quick Start Guide.