SE(3)-Transformers for PyTorch

NVIDIA

Resource

NVIDIA

SE(3)-Transformers for PyTorch

A Graph Neural Network using a variant of self-attention for 3D points and graphs processing.

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

To benchmark the training performance on a specific batch size, run bash scripts/benchmark_train.sh {BATCH_SIZE} for single GPU, and bash scripts/benchmark_train_multi_gpu.sh {BATCH_SIZE} for multi-GPU.

Inference performance benchmark

To benchmark the inference performance on a specific batch size, run bash scripts/benchmark_inference.sh {BATCH_SIZE}.

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference.

Training accuracy results

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the scripts/train.sh training script in the PyTorch 21.07 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.

GPUs	Batch size / GPU	Absolute error - TF32	Absolute error - mixed precision	Time to train - TF32	Time to train - mixed precision	Time to train speedup (mixed precision to TF32)
1	240	0.03456	0.03460	1h23min	1h03min	1.32x
8	240	0.03417	0.03424	15min	12min	1.25x

Training accuracy: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running the scripts/train.sh training script in the PyTorch 21.07 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.

GPUs	Batch size / GPU	Absolute error - FP32	Absolute error - mixed precision	Time to train - FP32	Time to train - mixed precision	Time to train speedup (mixed precision to FP32)
1	240	0.03432	0.03439	2h25min	1h33min	1.56x
8	240	0.03380	0.03495	29min	20min	1.45x

Training performance results

Training performance: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the scripts/benchmark_train.sh and scripts/benchmark_train_multi_gpu.sh benchmarking scripts in the PyTorch 21.07 NGC container on NVIDIA DGX A100 with 8x A100 80GB GPUs. Performance numbers (in molecules per millisecond) were averaged over five entire training epochs after a warmup epoch.

GPUs	Batch size / GPU	Throughput - TF32 [mol/ms]	Throughput - mixed precision [mol/ms]	Throughput speedup (mixed precision - TF32)	Weak scaling - TF32	Weak scaling - mixed precision
1	240	2.21	2.92	1.32x
1	120	1.81	2.04	1.13x
8	240	15.88	21.02	1.32x	7.18	7.20
8	120	12.68	13.99	1.10x	7.00	6.86

To achieve these same results, follow the steps in the Quick Start Guide.

Training performance: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running the scripts/benchmark_train.sh and scripts/benchmark_train_multi_gpu.sh benchmarking scripts in the PyTorch 21.07 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. Performance numbers (in molecules per millisecond) were averaged over five entire training epochs after a warmup epoch.

GPUs	Batch size / GPU	Throughput - FP32 [mol/ms]	Throughput - mixed precision [mol/ms]	Throughput speedup (FP32 - mixed precision)	Weak scaling - FP32	Weak scaling - mixed precision
1	240	1.25	1.88	1.50x
1	120	1.03	1.41	1.37x
8	240	8.68	12.75	1.47x	6.94	6.78
8	120	6.64	8.58	1.29x	6.44	6.08

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance results

Inference performance: NVIDIA DGX A100 (1x A100 80GB)

Our results were obtained by running the scripts/benchmark_inference.sh inferencing benchmarking script in the PyTorch 21.07 NGC container on NVIDIA DGX A100 with 1x A100 80GB GPU.

FP16

Batch size	Throughput Avg [mol/ms]	Latency Avg [ms]	Latency 90% [ms]	Latency 95% [ms]	Latency 99% [ms]
1600	11.60	140.94	138.29	140.12	386.40
800	10.74	75.69	75.74	76.50	79.77
400	8.86	45.57	46.11	46.60	49.97

TF32

Batch size	Throughput Avg [mol/ms]	Latency Avg [ms]	Latency 90% [ms]	Latency 95% [ms]	Latency 99% [ms]
1600	8.58	189.20	186.39	187.71	420.28
800	8.28	97.56	97.20	97.73	101.13
400	7.55	53.38	53.72	54.48	56.62

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance: NVIDIA DGX-1 (1x V100 16GB)

Our results were obtained by running the scripts/benchmark_inference.sh inferencing benchmarking script in the PyTorch 21.07 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPU.

FP16

Batch size	Throughput Avg [mol/ms]	Latency Avg [ms]	Latency 90% [ms]	Latency 95% [ms]	Latency 99% [ms]
1600	6.42	254.54	247.97	249.29	721.15
800	6.13	132.07	131.90	132.70	140.15
400	5.37	75.12	76.01	76.66	79.90

FP32

Batch size	Throughput Avg [mol/ms]	Latency Avg [ms]	Latency 90% [ms]	Latency 95% [ms]	Latency 99% [ms]
1600	3.39	475.86	473.82	475.64	891.18
800	3.36	239.17	240.64	241.65	243.70
400	3.17	126.67	128.19	128.82	130.54

To achieve these same results, follow the steps in the Quick Start Guide.