The following section shows how to run benchmarks measuring the model performance in training and inference modes.
To benchmark the training performance on a specific batch size, source length, target length and dataset for one epoch, run:
bash scripts/run_training_benchmark.sh <batch size> <max source length> <max target length> <data dir>
The resulting NUM_GPU
and PRECISION vs Throughput is stored in results/bart_pyt_training_benchmark_${DATESTAMP}/inference_benchmark.log
To benchmark the inference performance on a specific batch size, source length, target length and dataset, run:
bash scripts/run_inference_benchmark.sh <predict batch size> <eval beams> <max source length> <max target length> <model name or path> <data dir> <config path>
The resulting NUM_GPU
and PRECISION vs Throughput is stored in results/bart_pyt_inference_benchmark_${DATESTAMP}/inference_benchmark.log
The following sections provide details on how we achieved our performance and accuracy in training and inference.
Our results for XSUM dataset were obtained by running the run_summarization.sh
training script in the PyTorch 21.02-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Accuracy column lists rogue1, rogue2 and rogueLSum scores.
GPUs | Batch size (TF32, mixed precision) | Accuracy - TF32 | Accuracy - mixed precision | Time to train (hrs) - TF32 | Time to train (hrs) - mixed precision | Time to train (hrs) speedup (TF32 to mixed precision) |
---|---|---|---|---|---|---|
1 | 24, 40 | 44.41, 21.02, 35.66 | 44.87, 21.49, 36.17 | 3.10 | 2.43 | 1.27 |
8 | 192, 320 | 45.34, 21.93, 36.61 | 45.31, 21.83, 36.60 | 0.58 | 0.45 | 1.27 |
In addition,results for CNN-DM dataset are:
GPUs | Batch size (TF32, mixed precision) | Accuracy - TF32 | Accuracy - mixed precision | Time to train (hrs) - TF32 | Time to train (hrs) - mixed precision | Time to train (hrs) speedup (TF32 to mixed precision) |
---|---|---|---|---|---|---|
1 | 24, 40 | 44.37, 21.36, 41.17 | 44.43, 21.43, 41.22 | 4.88 | 3.61 | 1.35 |
8 | 192, 320 | 44.49, 21.48, 41.28 | 44.19, 21.26, 40.97 | 0.73 | 0.56 | 1.30 |
Our results were obtained by running the run_summarization.sh
training script in the PyTorch 21.02-py3 NGC container on NVIDIA DGX-2 with (16x V100 32GB) GPUs. Accuracy column lists rogue1, rogue2 and rogueLSum scores.
GPUs | Batch size (FP32, mixed precision) | Accuracy - FP32 | Accuracy - mixed precision | Time to train (hrs) - FP32 | Time to train (hrs) - mixed precision | Time to train (hrs) speedup (FP32 to mixed precision) |
---|---|---|---|---|---|---|
1 | 8, 14 | 44.16, 20.66, 35.24 | 44.86, 21.41, 36.02 | 17.23 | 6.12 | 2.82 |
8 | 64, 112 | 45.42, 21.91, 36.62 | 45.58, 22.01, 36.79 | 2.56 | 1.09 | 2.36 |
In addition,results for CNN-DM dataset are:
GPUs | Batch size (FP32, mixed precision) | Accuracy - FP32 | Accuracy - mixed precision | Time to train (hrs) - FP32 | Time to train (hrs) - mixed precision | Time to train (hrs) speedup (FP32 to mixed precision) |
---|---|---|---|---|---|---|
1 | 8, 14 | 44.49, 21.48, 41.26 | 44.55, 21.47, 41.32 | 26.17 | 9.74 | 2.69 |
8 | 64, 112 | 44.34, 21.42, 41.12 | 44.27, 21.30, 41.06 | 3.58 | 1.45 | 2.46 |
Our results for XSUM dataset were obtained by running the run_summarization.sh
training script in the PyTorch 21.02-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Accuracy column lists rogue1 scores across 5 different training runs with different seeds on DGX A100.
FP16, 8x GPUs | seed 1 | seed 2 | seed 3 | seed 4 | seed 5 | mean | std |
---|---|---|---|---|---|---|---|
rogue1 | 45.34 | 45.34 | 45.21 | 45.33 | 45.34 | 45.31 | 0.055 |
Our results were obtained by running the run_summarization.sh
training script in the PyTorch 21.02-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.
GPUs | Batch size / GPU (TF32, mixed precision) | Throughput - TF32 | Throughput - mixed precision | Throughput speedup (TF32 - mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|
1 | 24, 40 | 31607 | 42076 | 1.33 | 1.00 | 1.00 |
8 | 24, 40 | 163054 | 217514 | 1.33 | 5.16 | 5.17 |
To achieve these same results, follow the steps in the Quick Start Guide.
The performance metrics used are tokens per second computed from iterating through an entire epoch of XSum dataset with source length = 1024 and target length = 60.
Our results were obtained by running the run_summarization.sh
training script in the PyTorch 21.02-py3 NGC container on NVIDIA DGX-2 with (16x V100 32GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.
GPUs | Batch size / GPU (FP32, mixed precision) | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|
1 | 8, 14 | 7527 | 19356 | 2.57 | 1.00 | 1.00 |
8 | 8, 14 | 42024 | 111720 | 2.65 | 5.58 | 5.77 |
To achieve these same results, follow the steps in the Quick Start Guide.
The performance metrics used are tokens per second computed from iterating through an entire epoch of XSum dataset with source length = 1024 and target length = 60.
Our results were obtained by running the run_eval_summarization.sh
inferencing benchmarking script in the PyTorch 20.11-py3 NGC container on NVIDIA DGX A100 (1x A100 80GB) GPU.
FP16
Batch size | Latency Avg | Latency 90% | Latency 95% | Latency 99% | Throughput |
---|---|---|---|---|---|
1 | 0.43 | 0.53 | 0.57 | 0.67 | 2.34 |
4 | 0.64 | 0.75 | 0.81 | 0.95 | 6.28 |
8 | 0.86 | 1.01 | 1.09 | 1.20 | 9.35 |
16 | 1.29 | 1.56 | 1.65 | 1.76 | 12.44 |
32 | 2.38 | 3.06 | 3.23 | 3.33 | 13.42 |
64 | 4.70 | 6.06 | 6.25 | 6.35 | 13.55 |
128 | 10.10 | 12.22 | 12.32 | 12.96 | 12.61 |
To achieve these same results, follow the steps in the Quick Start Guide.
The inference performance metrics used are milliseconds per iteration. They are computed by iterating through the XSum test data with source length = 1024, target length = 60 and beam search = 6.
Our results were obtained by running the run_eval_summarization.sh
inferencing benchmarking script in the PyTorch 20.11-py3 NGC container on NVIDIA DGX-2 with (1x V100 32GB) GPU.
FP16
Batch size | Latency Avg | Latency 90% | Latency 95% | Latency 99% | Throughput |
---|---|---|---|---|---|
1 | 0.67 | 0.84 | 0.89 | 1.04 | 1.49 |
4 | 0.96 | 1.14 | 1.24 | 1.43 | 4.16 |
8 | 1.33 | 1.59 | 1.72 | 1.90 | 6.01 |
16 | 1.99 | 2.39 | 2.57 | 2.69 | 8.04 |
32 | 3.41 | 4.31 | 4.53 | 4.63 | 9.36 |
64 | 6.66 | 8.61 | 8.75 | 8.92 | 9.55 |
To achieve these same results, follow the steps in the Quick Start Guide.
The inference performance metrics used are milliseconds per iteration. They are computed by iterating through the XSum test data with source length = 1024, target length = 60 and beam search = 6.
Our results were obtained by running the run_eval_summarization.sh
inferencing benchmarking script in the PyTorch 21.02-py3 NGC container on NVIDIA T4 with GPU.
FP16
Batch size | Latency Avg | Latency 90% | Latency 95% | Latency 99% | Throughput |
---|---|---|---|---|---|
1 | 0.42 | 0.52 | 0.56 | 0.66 | 2.40 |
4 | 0.72 | 0.89 | 0.96 | 1.09 | 5.58 |
8 | 1.13 | 1.60 | 1.73 | 1.96 | 7.08 |
16 | 2.25 | 3.19 | 3.38 | 3.58 | 7.11 |
32 | 4.44 | 6.53 | 6.96 | 7.21 | 7.19 |
To achieve these same results, follow the steps in the Quick Start Guide.
The inference performance metrics used are milliseconds per iteration. They are computed by iterating through the XSum test data with source length = 1024, target length = 60 and beam search = 6.