A Graph Neural Network using a variant of self-attention for 3D points and graphs processing.
The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.
Benchmarking
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
Training performance benchmark
To benchmark the training performance on a specific batch size, run bash scripts/benchmark_train.sh {BATCH_SIZE} for single GPU, and bash scripts/benchmark_train_multi_gpu.sh {BATCH_SIZE} for multi-GPU.
Inference performance benchmark
To benchmark the inference performance on a specific batch size, run bash scripts/benchmark_inference.sh {BATCH_SIZE}.
Results
The following sections provide details on how we achieved our performance and accuracy in training and inference.
Training accuracy results
Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
Our results were obtained by running the scripts/train.sh training script in the PyTorch 21.07 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.
| GPUs | Batch size / GPU | Absolute error - TF32 | Absolute error - mixed precision | Time to train - TF32 | Time to train - mixed precision | Time to train speedup (mixed precision to TF32) |
|---|---|---|---|---|---|---|
| 1 | 240 | 0.03456 | 0.03460 | 1h23min | 1h03min | 1.32x |
| 8 | 240 | 0.03417 | 0.03424 | 15min | 12min | 1.25x |
Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
Our results were obtained by running the scripts/train.sh training script in the PyTorch 21.07 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
| GPUs | Batch size / GPU | Absolute error - FP32 | Absolute error - mixed precision | Time to train - FP32 | Time to train - mixed precision | Time to train speedup (mixed precision to FP32) |
|---|---|---|---|---|---|---|
| 1 | 240 | 0.03432 | 0.03439 | 2h25min | 1h33min | 1.56x |
| 8 | 240 | 0.03380 | 0.03495 | 29min | 20min | 1.45x |
Training performance results
Training performance: NVIDIA DGX A100 (8x A100 80GB)
Our results were obtained by running the scripts/benchmark_train.sh and scripts/benchmark_train_multi_gpu.sh benchmarking scripts in the PyTorch 21.07 NGC container on NVIDIA DGX A100 with 8x A100 80GB GPUs. Performance numbers (in molecules per millisecond) were averaged over five entire training epochs after a warmup epoch.
| GPUs | Batch size / GPU | Throughput - TF32 [mol/ms] | Throughput - mixed precision [mol/ms] | Throughput speedup (mixed precision - TF32) | Weak scaling - TF32 | Weak scaling - mixed precision |
|---|---|---|---|---|---|---|
| 1 | 240 | 2.21 | 2.92 | 1.32x | ||
| 1 | 120 | 1.81 | 2.04 | 1.13x | ||
| 8 | 240 | 15.88 | 21.02 | 1.32x | 7.18 | 7.20 |
| 8 | 120 | 12.68 | 13.99 | 1.10x | 7.00 | 6.86 |
To achieve these same results, follow the steps in the Quick Start Guide.
Training performance: NVIDIA DGX-1 (8x V100 16GB)
Our results were obtained by running the scripts/benchmark_train.sh and scripts/benchmark_train_multi_gpu.sh benchmarking scripts in the PyTorch 21.07 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. Performance numbers (in molecules per millisecond) were averaged over five entire training epochs after a warmup epoch.
| GPUs | Batch size / GPU | Throughput - FP32 [mol/ms] | Throughput - mixed precision [mol/ms] | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
|---|---|---|---|---|---|---|
| 1 | 240 | 1.25 | 1.88 | 1.50x | ||
| 1 | 120 | 1.03 | 1.41 | 1.37x | ||
| 8 | 240 | 8.68 | 12.75 | 1.47x | 6.94 | 6.78 |
| 8 | 120 | 6.64 | 8.58 | 1.29x | 6.44 | 6.08 |
To achieve these same results, follow the steps in the Quick Start Guide.
Inference performance results
Inference performance: NVIDIA DGX A100 (1x A100 80GB)
Our results were obtained by running the scripts/benchmark_inference.sh inferencing benchmarking script in the PyTorch 21.07 NGC container on NVIDIA DGX A100 with 1x A100 80GB GPU.
FP16
| Batch size | Throughput Avg [mol/ms] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
|---|---|---|---|---|---|
| 1600 | 11.60 | 140.94 | 138.29 | 140.12 | 386.40 |
| 800 | 10.74 | 75.69 | 75.74 | 76.50 | 79.77 |
| 400 | 8.86 | 45.57 | 46.11 | 46.60 | 49.97 |
TF32
| Batch size | Throughput Avg [mol/ms] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
|---|---|---|---|---|---|
| 1600 | 8.58 | 189.20 | 186.39 | 187.71 | 420.28 |
| 800 | 8.28 | 97.56 | 97.20 | 97.73 | 101.13 |
| 400 | 7.55 | 53.38 | 53.72 | 54.48 | 56.62 |
To achieve these same results, follow the steps in the Quick Start Guide.
Inference performance: NVIDIA DGX-1 (1x V100 16GB)
Our results were obtained by running the scripts/benchmark_inference.sh inferencing benchmarking script in the PyTorch 21.07 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPU.
FP16
| Batch size | Throughput Avg [mol/ms] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
|---|---|---|---|---|---|
| 1600 | 6.42 | 254.54 | 247.97 | 249.29 | 721.15 |
| 800 | 6.13 | 132.07 | 131.90 | 132.70 | 140.15 |
| 400 | 5.37 | 75.12 | 76.01 | 76.66 | 79.90 |
FP32
| Batch size | Throughput Avg [mol/ms] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
|---|---|---|---|---|---|
| 1600 | 3.39 | 475.86 | 473.82 | 475.64 | 891.18 |
| 800 | 3.36 | 239.17 | 240.64 | 241.65 | 243.70 |
| 400 | 3.17 | 126.67 | 128.19 | 128.82 | 130.54 |
To achieve these same results, follow the steps in the Quick Start Guide.