The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
To benchmark the training performance on a specific batch size, run bash scripts/benchmark_train.sh {BATCH_SIZE}
for single GPU, and bash scripts/benchmark_train_multi_gpu.sh {BATCH_SIZE}
for multi-GPU.
To benchmark the inference performance on a specific batch size, run bash scripts/benchmark_inference.sh {BATCH_SIZE}
.
The following sections provide details on how we achieved our performance and accuracy in training and inference.
Our results were obtained by running the scripts/train.sh
training script in the PyTorch 21.07 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.
GPUs | Batch size / GPU | Absolute error - TF32 | Absolute error - mixed precision | Time to train - TF32 | Time to train - mixed precision | Time to train speedup (mixed precision to TF32) |
---|---|---|---|---|---|---|
1 | 240 | 0.03456 | 0.03460 | 1h23min | 1h03min | 1.32x |
8 | 240 | 0.03417 | 0.03424 | 15min | 12min | 1.25x |
Our results were obtained by running the scripts/train.sh
training script in the PyTorch 21.07 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
GPUs | Batch size / GPU | Absolute error - FP32 | Absolute error - mixed precision | Time to train - FP32 | Time to train - mixed precision | Time to train speedup (mixed precision to FP32) |
---|---|---|---|---|---|---|
1 | 240 | 0.03432 | 0.03439 | 2h25min | 1h33min | 1.56x |
8 | 240 | 0.03380 | 0.03495 | 29min | 20min | 1.45x |
Our results were obtained by running the scripts/benchmark_train.sh
and scripts/benchmark_train_multi_gpu.sh
benchmarking scripts in the PyTorch 21.07 NGC container on NVIDIA DGX A100 with 8x A100 80GB GPUs. Performance numbers (in molecules per millisecond) were averaged over five entire training epochs after a warmup epoch.
GPUs | Batch size / GPU | Throughput - TF32 [mol/ms] | Throughput - mixed precision [mol/ms] | Throughput speedup (mixed precision - TF32) | Weak scaling - TF32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|
1 | 240 | 2.21 | 2.92 | 1.32x | ||
1 | 120 | 1.81 | 2.04 | 1.13x | ||
8 | 240 | 15.88 | 21.02 | 1.32x | 7.18 | 7.20 |
8 | 120 | 12.68 | 13.99 | 1.10x | 7.00 | 6.86 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the scripts/benchmark_train.sh
and scripts/benchmark_train_multi_gpu.sh
benchmarking scripts in the PyTorch 21.07 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. Performance numbers (in molecules per millisecond) were averaged over five entire training epochs after a warmup epoch.
GPUs | Batch size / GPU | Throughput - FP32 [mol/ms] | Throughput - mixed precision [mol/ms] | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|
1 | 240 | 1.25 | 1.88 | 1.50x | ||
1 | 120 | 1.03 | 1.41 | 1.37x | ||
8 | 240 | 8.68 | 12.75 | 1.47x | 6.94 | 6.78 |
8 | 120 | 6.64 | 8.58 | 1.29x | 6.44 | 6.08 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the scripts/benchmark_inference.sh
inferencing benchmarking script in the PyTorch 21.07 NGC container on NVIDIA DGX A100 with 1x A100 80GB GPU.
FP16
Batch size | Throughput Avg [mol/ms] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
---|---|---|---|---|---|
1600 | 11.60 | 140.94 | 138.29 | 140.12 | 386.40 |
800 | 10.74 | 75.69 | 75.74 | 76.50 | 79.77 |
400 | 8.86 | 45.57 | 46.11 | 46.60 | 49.97 |
TF32
Batch size | Throughput Avg [mol/ms] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
---|---|---|---|---|---|
1600 | 8.58 | 189.20 | 186.39 | 187.71 | 420.28 |
800 | 8.28 | 97.56 | 97.20 | 97.73 | 101.13 |
400 | 7.55 | 53.38 | 53.72 | 54.48 | 56.62 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the scripts/benchmark_inference.sh
inferencing benchmarking script in the PyTorch 21.07 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPU.
FP16
Batch size | Throughput Avg [mol/ms] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
---|---|---|---|---|---|
1600 | 6.42 | 254.54 | 247.97 | 249.29 | 721.15 |
800 | 6.13 | 132.07 | 131.90 | 132.70 | 140.15 |
400 | 5.37 | 75.12 | 76.01 | 76.66 | 79.90 |
FP32
Batch size | Throughput Avg [mol/ms] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
---|---|---|---|---|---|
1600 | 3.39 | 475.86 | 473.82 | 475.64 | 891.18 |
800 | 3.36 | 239.17 | 240.64 | 241.65 | 243.70 |
400 | 3.17 | 126.67 | 128.19 | 128.82 | 130.54 |
To achieve these same results, follow the steps in the Quick Start Guide.