NGC | Catalog
Welcome Guest
CatalogResourcesBART for PyTorch

BART for PyTorch

For downloads and more information, please view on a desktop device.
Logo for BART for PyTorch

Description

BART is a denoising autoencoder for pretraining sequence-to-sequence models.

Publisher

NVIDIA

Use Case

Nlp

Framework

PyTorch

Latest Version

21.02.1

Modified

November 12, 2021

Compressed Size

1.06 MB

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

To benchmark the training performance on a specific batch size, source length, target length and dataset for one epoch, run:

bash scripts/run_training_benchmark.sh <batch size> <max source length> <max target length> <data dir>

The resulting NUM_GPU and PRECISION vs Throughput is stored in results/bart_pyt_training_benchmark_${DATESTAMP}/inference_benchmark.log

Inference performance benchmark

To benchmark the inference performance on a specific batch size, source length, target length and dataset, run:

bash scripts/run_inference_benchmark.sh <predict batch size> <eval beams> <max source length> <max target length> <model name or path> <data dir> <config path>

The resulting NUM_GPU and PRECISION vs Throughput is stored in results/bart_pyt_inference_benchmark_${DATESTAMP}/inference_benchmark.log

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference.

Training accuracy results

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

Our results for XSUM dataset were obtained by running the run_summarization.sh training script in the PyTorch 21.02-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Accuracy column lists rogue1, rogue2 and rogueLSum scores.

GPUs Batch size (TF32, mixed precision) Accuracy - TF32 Accuracy - mixed precision Time to train (hrs) - TF32 Time to train (hrs) - mixed precision Time to train (hrs) speedup (TF32 to mixed precision)
1 24, 40 44.41, 21.02, 35.66 44.87, 21.49, 36.17 3.10 2.43 1.27
8 192, 320 45.34, 21.93, 36.61 45.31, 21.83, 36.60 0.58 0.45 1.27

In addition,results for CNN-DM dataset are:

GPUs Batch size (TF32, mixed precision) Accuracy - TF32 Accuracy - mixed precision Time to train (hrs) - TF32 Time to train (hrs) - mixed precision Time to train (hrs) speedup (TF32 to mixed precision)
1 24, 40 44.37, 21.36, 41.17 44.43, 21.43, 41.22 4.88 3.61 1.35
8 192, 320 44.49, 21.48, 41.28 44.19, 21.26, 40.97 0.73 0.56 1.30
Training accuracy: NVIDIA DGX-1 V100 (8x V100 32GB)

Our results were obtained by running the run_summarization.sh training script in the PyTorch 21.02-py3 NGC container on NVIDIA DGX-2 with (16x V100 32GB) GPUs. Accuracy column lists rogue1, rogue2 and rogueLSum scores.

GPUs Batch size (FP32, mixed precision) Accuracy - FP32 Accuracy - mixed precision Time to train (hrs) - FP32 Time to train (hrs) - mixed precision Time to train (hrs) speedup (FP32 to mixed precision)
1 8, 14 44.16, 20.66, 35.24 44.86, 21.41, 36.02 17.23 6.12 2.82
8 64, 112 45.42, 21.91, 36.62 45.58, 22.01, 36.79 2.56 1.09 2.36

In addition,results for CNN-DM dataset are:

GPUs Batch size (FP32, mixed precision) Accuracy - FP32 Accuracy - mixed precision Time to train (hrs) - FP32 Time to train (hrs) - mixed precision Time to train (hrs) speedup (FP32 to mixed precision)
1 8, 14 44.49, 21.48, 41.26 44.55, 21.47, 41.32 26.17 9.74 2.69
8 64, 112 44.34, 21.42, 41.12 44.27, 21.30, 41.06 3.58 1.45 2.46
Training stability test

Our results for XSUM dataset were obtained by running the run_summarization.sh training script in the PyTorch 21.02-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Accuracy column lists rogue1 scores across 5 different training runs with different seeds on DGX A100.

FP16, 8x GPUs seed 1 seed 2 seed 3 seed 4 seed 5 mean std
rogue1 45.34 45.34 45.21 45.33 45.34 45.31 0.055

Training performance results

Training performance: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the run_summarization.sh training script in the PyTorch 21.02-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.

GPUs Batch size / GPU (TF32, mixed precision) Throughput - TF32 Throughput - mixed precision Throughput speedup (TF32 - mixed precision) Weak scaling - TF32 Weak scaling - mixed precision
1 24, 40 31607 42076 1.33 1.00 1.00
8 24, 40 163054 217514 1.33 5.16 5.17

To achieve these same results, follow the steps in the Quick Start Guide.

The performance metrics used are tokens per second computed from iterating through an entire epoch of XSum dataset with source length = 1024 and target length = 60.

Training performance: NVIDIA DGX-1 V100 (8x V100 32GB)

Our results were obtained by running the run_summarization.sh training script in the PyTorch 21.02-py3 NGC container on NVIDIA DGX-2 with (16x V100 32GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.

GPUs Batch size / GPU (FP32, mixed precision) Throughput - FP32 Throughput - mixed precision Throughput speedup (FP32 - mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
1 8, 14 7527 19356 2.57 1.00 1.00
8 8, 14 42024 111720 2.65 5.58 5.77

To achieve these same results, follow the steps in the Quick Start Guide.

The performance metrics used are tokens per second computed from iterating through an entire epoch of XSum dataset with source length = 1024 and target length = 60.

Inference performance results

Inference performance: NVIDIA DGX A100 (1x A100 80GB)

Our results were obtained by running the run_eval_summarization.sh inferencing benchmarking script in the PyTorch 20.11-py3 NGC container on NVIDIA DGX A100 (1x A100 80GB) GPU.

FP16

Batch size Latency Avg Latency 90% Latency 95% Latency 99% Throughput
1 0.43 0.53 0.57 0.67 2.34
4 0.64 0.75 0.81 0.95 6.28
8 0.86 1.01 1.09 1.20 9.35
16 1.29 1.56 1.65 1.76 12.44
32 2.38 3.06 3.23 3.33 13.42
64 4.70 6.06 6.25 6.35 13.55
128 10.10 12.22 12.32 12.96 12.61

To achieve these same results, follow the steps in the Quick Start Guide.

The inference performance metrics used are milliseconds per iteration. They are computed by iterating through the XSum test data with source length = 1024, target length = 60 and beam search = 6.

Inference performance: NVIDIA DGX-1 V100 (1x V100 32GB)

Our results were obtained by running the run_eval_summarization.sh inferencing benchmarking script in the PyTorch 20.11-py3 NGC container on NVIDIA DGX-2 with (1x V100 32GB) GPU.

FP16

Batch size Latency Avg Latency 90% Latency 95% Latency 99% Throughput
1 0.67 0.84 0.89 1.04 1.49
4 0.96 1.14 1.24 1.43 4.16
8 1.33 1.59 1.72 1.90 6.01
16 1.99 2.39 2.57 2.69 8.04
32 3.41 4.31 4.53 4.63 9.36
64 6.66 8.61 8.75 8.92 9.55

To achieve these same results, follow the steps in the Quick Start Guide.

The inference performance metrics used are milliseconds per iteration. They are computed by iterating through the XSum test data with source length = 1024, target length = 60 and beam search = 6.

Inference performance: NVIDIA T4

Our results were obtained by running the run_eval_summarization.sh inferencing benchmarking script in the PyTorch 21.02-py3 NGC container on NVIDIA T4 with GPU.

FP16

Batch size Latency Avg Latency 90% Latency 95% Latency 99% Throughput
1 0.42 0.52 0.56 0.66 2.40
4 0.72 0.89 0.96 1.09 5.58
8 1.13 1.60 1.73 1.96 7.08
16 2.25 3.19 3.38 3.58 7.11
32 4.44 6.53 6.96 7.21 7.19

To achieve these same results, follow the steps in the Quick Start Guide.

The inference performance metrics used are milliseconds per iteration. They are computed by iterating through the XSum test data with source length = 1024, target length = 60 and beam search = 6.