SE(3)-Transformers for PyTorch

NVIDIA Deep Learning Examples

Resource

NVIDIA Deep Learning Examples

SE(3)-Transformers for PyTorch

A Graph Neural Network using a variant of self-attention for 3D points and graphs processing.

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

To benchmark the training performance on a specific batch size, run bash scripts/benchmark_train.sh {BATCH_SIZE} for single GPU, and bash scripts/benchmark_train_multi_gpu.sh {BATCH_SIZE} for multi-GPU.

Inference performance benchmark

To benchmark the inference performance on a specific batch size, run bash scripts/benchmark_inference.sh {BATCH_SIZE}.

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference.

Training accuracy results

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the scripts/train.sh and scripts/train_multi_gpu.sh training scripts in the PyTorch 23.01 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.

GPUs	Batch size / GPU	Absolute error - TF32	Absolute error - mixed precision	Time to train - TF32	Time to train - mixed precision	Time to train speedup (mixed precision to TF32)
1	240	0.03038	0.02987	1h02min	50min	1.24x
8	240	0.03466	0.03436	13min	10min	1.27x

Training accuracy: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running the scripts/train.sh and scripts/train_multi_gpu.sh training scripts in the PyTorch 23.01 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.

GPUs	Batch size / GPU	Absolute error - FP32	Absolute error - mixed precision	Time to train - FP32	Time to train - mixed precision	Time to train speedup (mixed precision to FP32)
1	240	0.03044	0.03076	2h07min	1h22min	1.55x
8	240	0.03435	0.03495	27min	19min	1.42x

Training performance results

Training performance: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the scripts/benchmark_train.sh and scripts/benchmark_train_multi_gpu.sh benchmarking scripts in the PyTorch 23.01 NGC container on NVIDIA DGX A100 with 8x A100 80GB GPUs. Performance numbers (in molecules per millisecond) were averaged over five entire training epochs after a warmup epoch.

GPUs	Batch size / GPU	Throughput - TF32 [mol/ms]	Throughput - mixed precision [mol/ms]	Throughput speedup (mixed precision - TF32)	Weak scaling - TF32	Weak scaling - mixed precision
1	240	2.59	3.23	1.25x
1	120	1.89	1.89	1.00x
8	240	18.38	21.42	1.17x	7.09	6.63
8	120	13.23	13.23	1.00x	7.00	7.00

To achieve these same results, follow the steps in the Quick Start Guide.

Training performance: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running the scripts/benchmark_train.sh and scripts/benchmark_train_multi_gpu.sh benchmarking scripts in the PyTorch 23.01 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. Performance numbers (in molecules per millisecond) were averaged over five entire training epochs after a warmup epoch.

GPUs	Batch size / GPU	Throughput - FP32 [mol/ms]	Throughput - mixed precision [mol/ms]	Throughput speedup (FP32 - mixed precision)	Weak scaling - FP32	Weak scaling - mixed precision
1	240	1.23	1.91	1.55x
1	120	1.01	1.23	1.22x
8	240	8.44	11.28	1.34x	6.8	5.90
8	120	6.06	7.36	1.21x	6.00	5.98

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance results

Inference performance: NVIDIA DGX A100 (1x A100 80GB)

Our results were obtained by running the scripts/benchmark_inference.sh inferencing benchmarking script in the PyTorch 23.01 NGC container on NVIDIA DGX A100 with 1x A100 80GB GPU.

AMP

Batch size	Throughput Avg [mol/ms]	Latency Avg [ms]	Latency 90% [ms]	Latency 95% [ms]	Latency 99% [ms]
1600	9.71	175.2	190.2	191.8	432.4
800	7.90	114.5	134.3	135.8	140.2
400	7.18	75.49	108.6	109.6	113.2

TF32

Batch size	Throughput Avg [mol/ms]	Latency Avg [ms]	Latency 90% [ms]	Latency 95% [ms]	Latency 99% [ms]
1600	8.19	198.2	206.8	208.5	377.0
800	7.56	107.5	119.6	120.5	125.7
400	6.97	59.8	75.1	75.7	81.3

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance: NVIDIA DGX-1 (1x V100 16GB)

Our results were obtained by running the scripts/benchmark_inference.sh inferencing benchmarking script in the PyTorch 23.01 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPU.

AMP

Batch size	Throughput Avg [mol/ms]	Latency Avg [ms]	Latency 90% [ms]	Latency 95% [ms]	Latency 99% [ms]
1600	5.39	306.6	321.2	324.9	819.1
800	4.67	179.8	201.5	203.8	213.3
400	4.25	108.2	142.0	143.0	149.0

FP32

Batch size	Throughput Avg [mol/ms]	Latency Avg [ms]	Latency 90% [ms]	Latency 95% [ms]	Latency 99% [ms]
1600	3.14	510.9	518.83	521.1	808.0
800	3.10	258.7	269.4	271.1	278.9
400	2.93	137.3	147.5	148.8	151.7

To achieve these same results, follow the steps in the Quick Start Guide.