The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
To benchmark the training performance on a specific number of GPUs, batch size and precision, run:
bash scripts/benchmark_training.sh <# GPUs> <batch_size> <precision>
Eg. running
./scripts/benchmark_training.sh 8 2048 amp
will measure performance for eight GPUs, batch size of 2048 per GPU and mixed precision and running:
./scripts/benchmark_training.sh 1 1024 full
will measure performance for single GPU, batch size of 1024 and full precision.
To benchmark the inference performance on a specific batch size and precision, run:
bash scripts/benchmark_inference.sh <batch size> <precision>
Eg. running
./scripts/benchmark_inference.sh 2048 amp
will measure performance for a batch size of 2048 and mixed precision and running:
./scripts/benchmark_inference.sh 1024 full
will measure performance for a batch size of 1024 and full precision.
The following sections provide details on how we achieved our performance and accuracy in training and inference.
Our results were obtained by running the scripts/train.sh
training script in the PyTorch 22.11 NGC container on NVIDIA A100 (8x A100 80GB) GPUs. The values presented below were averaged over 20 experiments.
GPUs | Batch size / GPU | NUV - TF32 | NUV - mixed precision | Time to train - TF32 | Time to train - mixed precision | Time to train speedup (TF32 to mixed precision) |
---|---|---|---|---|---|---|
1 | 512 | 89.63 % | 87.83 % | 5h8min | 4h0min | 1.28x |
8 | 512 | 87.03 % | 87.90 % | 48min | 40min | 1.20x |
The MoFlow model was trained for 300 epochs starting from 20 different initial random seeds. Every five training epochs, the model was evaluated by generating a small sample of molecules (100 molecules per GPU), and validity and uniqueness were calculated. The training was performed in the PyTorch 22.11 Docker container on NVIDIA DGX A100 with 8x A100 80GB GPUs with AMP and CUDA graph capture enabled. The following table summarizes the results of the stability test.
The following table displays the validity and uniqueness scores after every 50 epochs for different initial random seeds.
epoch | validity mean | validity std | validity min | validity max | validity median | uniqueness mean | uniqueness std | uniqueness min | uniqueness max | uniqueness median |
---|---|---|---|---|---|---|---|---|---|---|
50 | 68.22 | 5.25 | 57.38 | 74.75 | 69.50 | 93.64 | 8.22 | 62.56 | 99.82 | 95.30 |
100 | 76.91 | 4.23 | 69.50 | 84.38 | 77.50 | 99.39 | 0.92 | 96.31 | 100.00 | 99.83 |
150 | 80.48 | 3.80 | 73.88 | 88.25 | 81.75 | 99.58 | 0.78 | 96.64 | 100.00 | 99.85 |
200 | 83.87 | 3.98 | 77.00 | 90.62 | 84.44 | 99.76 | 0.38 | 98.81 | 100.00 | 100.00 |
250 | 86.08 | 4.46 | 77.12 | 93.12 | 86.56 | 99.87 | 0.21 | 99.27 | 100.00 | 100.00 |
300 | 87.29 | 3.70 | 77.75 | 93.38 | 87.69 | 99.82 | 0.30 | 98.70 | 100.00 | 99.93 |
Our results were obtained by running the scripts/benchmark_training.sh
training script in the PyTorch 22.11 NGC container on NVIDIA A100 (8x A100 80GB) GPUs. Performance numbers (in molecules per second) were averaged over 190 iterations after 10 warm-up steps.
GPUs | Batch size / GPU | Throughput - TF32 | Throughput - mixed precision | Throughput speedup (TF32 - mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|
1 | 512 | 3499.35 | 4524.15 | 1.29 | ||
1 | 1024 | 3883.49 | 5392.78 | 1.39 | ||
1 | 2048 | 4291.29 | 6118.46 | 1.43 | ||
8 | 512 | 24108.04 | 29293.41 | 1.22 | 6.89 | 6.47 |
8 | 1024 | 28104.62 | 37365.05 | 1.33 | 7.24 | 6.93 |
8 | 2048 | 30927.04 | 42078.31 | 1.36 | 7.21 | 6.88 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the scripts/benchmark_inference.sh
inferencing benchmarking script in the PyTorch 22.11 NGC container on the NVIDIA A100 (1x A100 80GB) GPU.
FP16
Batch size | Throughput Avg | Latency Avg | Latency 90% | Latency 95% | Latency 99% |
---|---|---|---|---|---|
512 | 12524.49 | 41 | 41 | 41 | 41 |
1024 | 13871.60 | 74 | 74 | 74 | 74 |
2048 | 14386.44 | 142 | 144 | 144 | 144 |
TF32
Batch size | Throughput Avg | Latency Avg | Latency 90% | Latency 95% | Latency 99% |
---|---|---|---|---|---|
512 | 9696.35 | 53 | 53 | 53 | 53 |
1024 | 10242.98 | 100 | 100 | 100 | 100 |
2048 | 11174.75 | 183 | 187 | 187 | 187 |
To achieve these same results, follow the steps in the Quick Start Guide.