NVIDIA
NVIDIA
SE(3)-Transformers for PyTorch
Resource
NVIDIA
NVIDIA
SE(3)-Transformers for PyTorch

A Graph Neural Network using a variant of self-attention for 3D points and graphs processing.

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

To benchmark the training performance on a specific batch size, run bash scripts/benchmark_train.sh {BATCH_SIZE} for single GPU, and bash scripts/benchmark_train_multi_gpu.sh {BATCH_SIZE} for multi-GPU.

Inference performance benchmark

To benchmark the inference performance on a specific batch size, run bash scripts/benchmark_inference.sh {BATCH_SIZE}.

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference.

Training accuracy results

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the scripts/train.sh training script in the PyTorch 21.07 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.

GPUsBatch size / GPUAbsolute error - TF32Absolute error - mixed precisionTime to train - TF32Time to train - mixed precisionTime to train speedup (mixed precision to TF32)
12400.034560.034601h23min1h03min1.32x
82400.034170.0342415min12min1.25x
Training accuracy: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running the scripts/train.sh training script in the PyTorch 21.07 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.

GPUsBatch size / GPUAbsolute error - FP32Absolute error - mixed precisionTime to train - FP32Time to train - mixed precisionTime to train speedup (mixed precision to FP32)
12400.034320.034392h25min1h33min1.56x
82400.033800.0349529min20min1.45x

Training performance results

Training performance: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the scripts/benchmark_train.sh and scripts/benchmark_train_multi_gpu.sh benchmarking scripts in the PyTorch 21.07 NGC container on NVIDIA DGX A100 with 8x A100 80GB GPUs. Performance numbers (in molecules per millisecond) were averaged over five entire training epochs after a warmup epoch.

GPUsBatch size / GPUThroughput - TF32 [mol/ms]Throughput - mixed precision [mol/ms]Throughput speedup (mixed precision - TF32)Weak scaling - TF32Weak scaling - mixed precision
12402.212.921.32x
11201.812.041.13x
824015.8821.021.32x7.187.20
812012.6813.991.10x7.006.86

To achieve these same results, follow the steps in the Quick Start Guide.

Training performance: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running the scripts/benchmark_train.sh and scripts/benchmark_train_multi_gpu.sh benchmarking scripts in the PyTorch 21.07 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. Performance numbers (in molecules per millisecond) were averaged over five entire training epochs after a warmup epoch.

GPUsBatch size / GPUThroughput - FP32 [mol/ms]Throughput - mixed precision [mol/ms]Throughput speedup (FP32 - mixed precision)Weak scaling - FP32Weak scaling - mixed precision
12401.251.881.50x
11201.031.411.37x
82408.6812.751.47x6.946.78
81206.648.581.29x6.446.08

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance results

Inference performance: NVIDIA DGX A100 (1x A100 80GB)

Our results were obtained by running the scripts/benchmark_inference.sh inferencing benchmarking script in the PyTorch 21.07 NGC container on NVIDIA DGX A100 with 1x A100 80GB GPU.

FP16

Batch sizeThroughput Avg [mol/ms]Latency Avg [ms]Latency 90% [ms]Latency 95% [ms]Latency 99% [ms]
160011.60140.94138.29140.12386.40
80010.7475.6975.7476.5079.77
4008.8645.5746.1146.6049.97

TF32

Batch sizeThroughput Avg [mol/ms]Latency Avg [ms]Latency 90% [ms]Latency 95% [ms]Latency 99% [ms]
16008.58189.20186.39187.71420.28
8008.2897.5697.2097.73101.13
4007.5553.3853.7254.4856.62

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance: NVIDIA DGX-1 (1x V100 16GB)

Our results were obtained by running the scripts/benchmark_inference.sh inferencing benchmarking script in the PyTorch 21.07 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPU.

FP16

Batch sizeThroughput Avg [mol/ms]Latency Avg [ms]Latency 90% [ms]Latency 95% [ms]Latency 99% [ms]
16006.42254.54247.97249.29721.15
8006.13132.07131.90132.70140.15
4005.3775.1276.0176.6679.90

FP32

Batch sizeThroughput Avg [mol/ms]Latency Avg [ms]Latency 90% [ms]Latency 95% [ms]Latency 99% [ms]
16003.39475.86473.82475.64891.18
8003.36239.17240.64241.65243.70
4003.17126.67128.19128.82130.54

To achieve these same results, follow the steps in the Quick Start Guide.

NVIDIA uses cookies to improve your experience on our web site. We and our third-party partners also use cookies and other tools to collect and record information you provide as well as information about your interactions with our websites for performance improvement, analytics, and to assist in marketing efforts. By clicking "Accept All", you consent to our use of cookies and other tools as described in our Cookie Policy. You can manage your cookie settings by clicking on "Manage Settings." By continuing to use this site or by clicking one of the buttons below, you agree to our Terms of Service (which contains important waivers). Please see our Privacy Policy for more information on our privacy practices.