The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
Training performance benchmarks for both pre-training phases can be obtained by running scripts/benchmark_pretraining.sh
. Default parameters are set to run a few training steps for a converging NVIDIA DGX A100 system.
To benchmark training performance with other parameters, run:
bash scripts/benchmark_pretraining.sh <train_batch_size_p1> <amp|tf32|fp32> <xla|no_xla> <num_gpus> <accumulate_gradients=true|false> <gradient_accumulation_steps_p1> <train_batch_size_p2> <gradient_accumulation_steps_p2> <base>
An example call used to generate throughput numbers:
bash scripts/benchmark_pretraining.sh 88 amp xla 8 true 2 12 4 base
Training performance benchmarks for fine-tuning can be obtained by running scripts/benchmark_squad.sh
. The required parameters can be passed through the command-line as described in Training process. The performance information is printed after 200 training iterations.
To benchmark the training performance on a specific batch size, run:
bash scripts/benchmark_squad.sh train <num_gpus> <batch size> <infer_batch_size> <amp|tf32|fp32> <SQuAD version> <path to SQuAD dataset> <results directory> <checkpoint_to_load> <cache_Dir>
An example call used to generate throughput numbers:
bash scripts/benchmark_squad.sh train 8 16
Inference performance benchmarks fine-tuning can be obtained by running scripts/benchmark_squad.sh
. The required parameters can be passed through the command-line as described in Inference process. This script runs one epoch by default on the SQuAD v1.1 dataset and extracts the average performance for the given configuration.
To benchmark the training performance on a specific batch size, run:
bash scripts/benchmark_squad.sh train <num_gpus> <batch size> <infer_batch_size> <amp|fp32> <SQuAD version> <path to SQuAD dataset> <results directory> <checkpoint_to_load> <cache_Dir>
An example call used to generate throughput numbers:
bash scripts/benchmark_squad.sh eval 8 256
The following sections provide details on how we achieved our performance and accuracy in training and inference. All results are on ELECTRA-base model and on SQuAD v1.1 dataset with a sequence length of 384 unless otherwise mentioned.
Phase 1 is shown by the blue curve and Phase 2 by the grey. Y axis stands for the total loss and x axis for the total steps trained.
DGX System | GPUs | Batch size / GPU (Phase 1 and Phase 2) | Accumulation steps (Phase 1 and Phase 2) | Final Loss - TF32/FP32 | Final Loss - mixed precision | Time to train(hours) - TF32/FP32 | Time to train(hours) - mixed precision | Time to train speedup (TF32/FP32 to mixed precision) |
---|---|---|---|---|---|---|---|---|
48 x DGX A100 | 8 | 176 and 24 | 1 and 3 | 8.686 | 8.68 | 1.61 | 1.126 | 1.43 |
24 x DGX-2H | 16 | 176 and 24 | 1 and 3 | 8.72 | 8.67 | 5.58 | 1.74 | 3.20 |
1 x DGX A100 | 8 | 176 and 24 | 48 and 144 | - | - | 54.84 | 30.47 | 1.8 |
1 x DGX-1 16G | 8 | 88 and 12 | 96 and 288 | - | - | 241.8 | 65.1 | 3.71 |
1 x DGX-2 32G | 16 | 176 and 24 | 24 and 72 | - | - | 109.97 | 29.08 | 3.78 |
In the above table, FP32 and TF32 runs were made at half the batch per GPU and twice the gradient accumulation steps of a run with mixed precision in order to not run out of memory.
The SQuAD fine-tuning scripts by default train on Google's ELECTRA++ base pretrained checkpoint which uses around 10x training dataset (dataset used by XLNet authors) and greater than 5x training steps compared to the training recipe in scripts/run_pretraining.sh
. The latter trains and achieves state-of-the-art accuracy on Wikipedia and BookCorpus datasets only.
Our results were obtained by running the scripts/run_squad.sh
training script in the tensorflow:20.07-tf2-py3 NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs.
ELECTRA BASE++
GPUs | Batch size / GPU | Accuracy / F1 - FP32 | Accuracy / F1 - mixed precision | Time to train - TF32 (sec) | Time to train - mixed precision (sec) | Time to train speedup (FP32 to mixed precision) |
---|---|---|---|---|---|---|
1 | 32 | 87.19 / 92.85 | 87.19 / 92.84 | 1699 | 749 | 2.27 |
8 | 32 | 86.84 / 92.57 | 86.83 / 92.56 | 263 | 201 | 1.30 |
Our results were obtained by running the scripts/run_squad.sh
training script in the tensorflow:20.07-tf2-py3 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
ELECTRA BASE++
GPUs | Batch size / GPU (FP32 : mixed precision) | Accuracy / F1 - FP32 | Accuracy / F1 - mixed precision | Time to train - FP32 (sec) | Time to train - mixed precision (sec) | Time to train speedup (FP32 to mixed precision) |
---|---|---|---|---|---|---|
1 | 8 : 16 | 87.36 / 92.82 | 87.32 / 92.74 | 5136 | 1378 | 3.73 |
8 | 8 : 16 | 87.02 / 92.73 | 87.02 / 92.72 | 730 | 334 | 2.18 |
ELECTRA BASE checkpoint Wikipedia and BookCorpus
GPUs | SQuAD version | Batch size / GPU (FP32 : mixed precision) | Accuracy / F1 - FP32 | Accuracy / F1 - mixed precision | Time to train - FP32 (sec) | Time to train - mixed precision (sec) | Time to train speedup (FP32 to mixed precision) |
---|---|---|---|---|---|---|---|
8 | v1.1 | 8 : 16 | 85.00 / 90.94 | 85.04 / 90.96 | 5136 | 1378 | 3.73 |
8 | v2.0 | 8 : 16 | 80.517 / 83.36 | 80.523 / 83.43 | 730 | 334 | 2.18 |
Our results were obtained by running the scripts/run_squad.sh
training script in the tensorflow:20.07-tf2-py3 NGC container on NVIDIA DGX-2 (16x V100 32G) GPUs.
ELECTRA BASE++
GPUs | Batch size / GPU | Accuracy / F1 - FP32 | Accuracy / F1 - mixed precision | Time to train - FP32 (sec) | Time to train - mixed precision (sec) | Time to train speedup (FP32 to mixed precision) |
---|---|---|---|---|---|---|
1 | 32 | 87.14 / 92.69 | 86.95 / 92.69 | 4478 | 1162 | 3.85 |
16 | 32 | 86.95 / 90.58 | 86.93 / 92.48 | 333 | 229 | 1.45 |
ELECTRA BASE Wikipedia and BookCorpus
Training stability with 48 x DGX A100, TF32 computations and loss reported after Phase 2:
Accuracy Metric | Seed 1 | Seed 2 | Seed 3 | Seed 4 | Seed 5 | Mean | Standard Deviation |
---|---|---|---|---|---|---|---|
Final Loss | 8.72 | 8.69 | 8.71 | 8.7 | 8.68 | 8.7 | 0.015 |
ELECTRA BASE++
Training stability with 8 GPUs, FP16 computations, batch size of 16 on SQuAD v1.1:
Accuracy Metric | Seed 1 | Seed 2 | Seed 3 | Seed 4 | Seed 5 | Mean | Standard Deviation |
---|---|---|---|---|---|---|---|
Exact Match % | 86.99 | 86.81 | 86.95 | 87.10 | 87.26 | 87.02 | 0.17 |
f1 % | 92.7 | 92.66 | 92.65 | 92.61 | 92.97 | 92.72 | 0.14 |
Training stability with 8 GPUs, FP16 computations, batch size of 16 on SQuAD v2.0:
Accuracy Metric | Seed 1 | Seed 2 | Seed 3 | Seed 4 | Seed 5 | Mean | Standard Deviation |
---|---|---|---|---|---|---|---|
Exact Match % | 83.00 | 82.84 | 83.11 | 82.70 | 82.94 | 82.91 | 0.15 |
f1 % | 85.63 | 85.48 | 85.69 | 85.31 | 85.57 | 85.54 | 0.15 |
Our results were obtained by running the scripts/benchmark_squad.sh
training script in the tensorflow:20.07-tf2-py3 NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.
GPUs | Batch size / GPU (TF32 and FP16) | Accumulation steps (TF32 and FP16) | Sequence length | Throughput - TF32(sequences/sec) | Throughput - mixed precision(sequences/sec) | Throughput speedup (TF32 - mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|---|---|
1 | 88 and 176 | 768 and 384 | 128 | 533 | 955 | 1.79 | 1.00 | 1.00 |
8 | 88 and 176 | 96 and 48 | 128 | 4202 | 7512 | 1.79 | 7.88 | 7.87 |
1 | 12 and 24 | 2304 and 1152 | 512 | 90 | 171 | 1.90 | 1.00 | 1.00 |
8 | 12 and 24 | 288 and 144 | 512 | 716 | 1347 | 1.88 | 7.96 | 7.88 |
GPUs | Batch size / GPU | Sequence length | Throughput - TF32 (sequences/sec) | Throughput - mixed precision (sequences/sec) | Throughput speedup (TF32 - mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|---|
1 | 32 | 384 | 107 | 317 | 2.96 | 1.00 | 1.00 |
8 | 32 | 384 | 828 | 2221 | 2.68 | 7.74 | 7.00 |
Our results were obtained by running the scripts/benchmark_squad.sh
training scripts in the tensorflow:20.07-tf2-py3 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs. Performance numbers (in sequences per second) were averaged over an entire training epoch.
GPUs | Batch size / GPU (FP32 and FP16) | Accumulation steps (FP32 and FP16) | Sequence length | Throughput - FP32(sequences/sec) | Throughput - mixed precision(sequences/sec) | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|---|---|
1 | 40 and 88 | 1689 and 768 | 128 | 116 | 444 | 3.83 | 1.00 | 1.00 |
8 | 40 and 88 | 211 and 96 | 128 | 920 | 3475 | 3.77 | 7.93 | 7.83 |
1 | 6 and 12 | 4608 and 2304 | 512 | 24 | 84 | 3.50 | 1.00 | 1.00 |
8 | 6 and 12 | 576 and 288 | 512 | 190 | 656 | 3.45 | 7.92 | 7.81 |
GPUs | Batch size / GPU (FP32 : mixed precision) | Sequence length | Throughput - FP32 (sequences/sec) | Throughput - mixed precision (sequences/sec) | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|---|
1 | 8 : 16 | 384 | 35 | 154 | 4.4 | 1.00 | 1.00 |
8 | 8 : 16 | 384 | 268 | 1051 | 3.92 | 7.66 | 6.82 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the scripts/benchmark_squad.sh
training scripts in the tensorflow:20.07-tf2-py3 NGC container on NVIDIA DGX-2 with (16x V100 32G) GPUs. Performance numbers (in sequences per second) were averaged over an entire training epoch.
GPUs | Batch size / GPU (FP32 and FP16) | Accumulation steps (FP32 and FP16) | Sequence length | Throughput - FP32(sequences/sec) | Throughput - mixed precision(sequences/sec) | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|---|---|
1 | 88 and 176 | 768 and 384 | 128 | 128 | 500 | 3.91 | 1.00 | 1.00 |
8 | 88 and 176 | 96 and 48 | 128 | 1011 | 3916 | 3.87 | 7.90 | 7.83 |
16 | 88 and 176 | 48 and 24 | 128 | 2018 | 7773 | 3.85 | 15.77 | 15.55 |
1 | 12 and 24 | 2304 and 1152 | 512 | 27 | 96 | 3.55 | 1.00 | 1.00 |
8 | 12 and 24 | 288 and 144 | 512 | 213 | 754 | 3.54 | 7.89 | 7.85 |
16 | 12 and 24 | 144 and 72 | 512 | 426 | 1506 | 3.54 | 15.78 | 15.69 |
GPUs | Batch size / GPU | Sequence length | Throughput - FP32 (sequences/sec) | Throughput - mixed precision (sequences/sec) | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|---|
1 | 16 | 384 | 40 | 184 | 4.6 | 1.00 | 1.00 |
8 | 16 | 384 | 311 | 1289 | 4.14 | 7.77 | 7.00 |
16 | 16 | 384 | 626 | 2594 | 4.14 | 15.65 | 14.09 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the scripts/benchmark_squad.sh
inferencing benchmarking script in the tensorflow:20.07-tf2-py3 NGC container on NVIDIA DGX A100 (1x A100 40GB) GPU.
FP16
Batch size | Sequence length | Throughput Avg (sequences/sec) | Latency Avg (ms) | Latency 90% (ms) | Latency 95% (ms) | Latency 99% (ms) |
---|---|---|---|---|---|---|
1 | 384 | 166 | 6.035 | 5.995 | 6.013 | 6.029 |
256 | 384 | 886 | 276.26 | 274.53 | 275.276 | 275.946 |
512 | 384 | 886 | 526.5 | 525.014 | 525.788 | 525.788 |
TF32
Batch size | Sequence length | Throughput Avg (sequences/sec) | Latency Avg (ms) | Latency 90% (ms) | Latency 95% (ms) | Latency 99% (ms) |
---|---|---|---|---|---|---|
1 | 384 | 122 | 8.228 | 8.171 | 8.198 | 8.221 |
256 | 384 | 342 | 729.293 | 727.990 | 728.505 | 729.027 |
512 | 384 | 350 | 1429.314 | 1427.719 | 1428.550 | 1428.550 |
Our results were obtained by running the scripts/benchmark_squad.sh
script in the tensorflow:20.07-tf2-py3 NGC container on NVIDIA Tesla T4 (1x T4 16GB) GPU.
FP16
Batch size | Sequence length | Throughput Avg (sequences/sec) | Latency Avg (ms) | Latency 90% (ms) | Latency 95% (ms) | Latency 99% (ms) |
---|---|---|---|---|---|---|
1 | 384 | 58 | 17.413 | 17.295 | 17.349 | 17.395 |
128 | 384 | 185 | 677.298 | 675.211 | 675.674 | 676.269 |
256 | 384 | 169 | 1451.396 | 1445.070 | 1447.654 | 1450.141 |
To achieve these same results, follow the steps in the Quick Start Guide.