The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
Both of these benchmarking scripts enable you to run a number of epochs, extract performance numbers, and run the BERT model for fine tuning.
Training benchmarking can be performed by running the script:
scripts/finetune_train_benchmark.sh <bert_model> <num_gpu> <batch_size> <precision> <use_xla>
This script runs 800 steps by default on the SQuAD v1.1 dataset and extracts performance numbers for the given configuration. These numbers are saved at /results/squad_train_benchmark_<bert_model>_gpu<num_gpu>_bs<batch_size>.log
.
Inference benchmarking can be performed by running the script:
scripts/finetune_inference_benchmark.sh <bert_model> <batch_size> <precision> <use_xla>
This script runs 1000 eval iterations by default on the SQuAD v1.1 dataset and extracts performance and latency numbers for the given configuration. These numbers are saved at /results/squad_inference_benchmark_<bert_model>_<precision>_bs<batch_size>.log
.
The following sections provide details on how we achieved our performance and accuracy in training and inference for fine tuning Question Answering. All results are on BERT-Large model unless otherwise mentioned. All fine tuning results are on SQuAD v1.1 using a sequence length of 384 unless otherwise mentioned.
Our results were obtained by running the scripts/run_pretraining_lamb.sh
training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-2 and NVIDIA DGX A100.
DGX System | Nodes x GPUs | Precision | Batch Size/GPU: Phase1, Phase2 | Accumulation Steps: Phase1, Phase2 | Time to Train (Hrs) | Final Loss |
---|---|---|---|---|---|---|
DGX2H | 32 x 16 | FP16 | 56, 10 | 2, 6 | 2.67 | 1.69 |
DGX2H | 32 x 16 | FP32 | 32, 4 | 4, 16 | 8.02 | 1.71 |
DGXA100 | 32 x 8 | FP16 | 312, 40 | 1, 3 | 2.02 | 1.68 |
DGXA100 | 32 x 8 | TF32 | 176, 22 | 2, 6 | 3.57 | 1.67 |
Our results were obtained by running the scripts/run_squad.sh
training script in the TensorFlow 20.12-py3 NGC container on NVIDIA DGX A100 with 8x A100 80GB GPUs.
GPUs | **Batch size / GPU: TF32, FP16 ** | Accuracy - TF32 | Accuracy - mixed precision | Time to Train - TF32 (Hrs) | Time to Train - mixed precision (Hrs) |
---|---|---|---|---|---|
8 | 38, 76 | 90.88 | 91.12 | 0.16 | 0.11 |
The following tables compare Final Loss
scores across 3 different training runs with different seeds, for both FP16 and TF32. The runs showcase consistent convergence on all 3 seeds with very little deviation.
FP16, 256x GPUs | seed 1 | seed 2 | seed 3 | mean | std |
---|---|---|---|---|---|
Final Loss | 1.657 | 1.661 | 1.683 | 1.667 | 0.014 |
TF32, 256x GPUs | seed 1 | seed 2 | seed 3 | mean | std |
---|---|---|---|---|---|
Final Loss | 1.67 | 1.654 | 1.636 | 1.653 | 0.017 |
The following tables compare F1
scores across 5 different training runs with different seeds, for both FP16 and TF32 respectively using the (NVIDIA Pretrained Checkpoint)[https://ngc.nvidia.com/catalog/models]. The runs showcase consistent convergence on all 5 seeds with very little deviation.
FP16, 8x GPUs | seed 1 | seed 2 | seed 3 | seed 4 | seed 5 | mean | std |
---|---|---|---|---|---|---|---|
F1 | 91.12 | 90.80 | 90.94 | 90.90 | 90.94 | 90.94 | 0.11 |
TF32, 8x GPUs | seed 1 | seed 2 | seed 3 | seed 4 | seed 5 | mean | std |
---|---|---|---|---|---|---|---|
F1 | 90.79 | 90.88 | 90.80 | 90.88 | 90.83 | 90.84 | 0.04 |
Our results were obtained by running the scripts/run_pretraining_lamb.sh
training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-2 with 16x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.
GPUs | Sequence Length | Batch size / GPU: mixed precision, FP32 | Gradient Accumulation: mixed precision, FP32 | Global Batch Size: mixed precision, FP32 | Throughput - mixed precision | Throughput - FP32 | Throughput speedup (FP32 - mixed precision) | Weak scaling - mixed precision | Weak scaling - FP32 |
---|---|---|---|---|---|---|---|---|---|
1 | 128 | 60 , 32 | 1024 , 2048 | 61440 , 65536 | 206.5 | 49.97 | 4.13 | 1.00 | 1.00 |
4 | 128 | 60 , 32 | 256 , 512 | 61440 , 65536 | 789.75 | 194.02 | 4.07 | 3.82 | 3.88 |
8 | 128 | 60 , 32 | 128 , 256 | 61440 , 65536 | 1561.77 | 367.9 | 4.25 | 7.56 | 7.36 |
16 | 128 | 60 , 32 | 64 , 128 | 61440 , 65536 | 3077.99 | 762.22 | 4.04 | 14.9 | 15.25 |
1 | 512 | 10 , 6 | 3072 , 5120 | 30720 , 30720 | 40.95 | 11.06 | 3.70 | 1.00 | 1.00 |
4 | 512 | 10 , 6 | 768 , 1280 | 30720 , 30720 | 158.5 | 43.05 | 3.68 | 3.87 | 3.89 |
8 | 512 | 10 , 6 | 384 , 640 | 30720 , 30720 | 312.03 | 85.51 | 3.65 | 7.62 | 7.73 |
16 | 512 | 10 , 4 | 192 , 512 | 30720 , 32768 | 614.94 | 161.38 | 3.81 | 15.02 | 14.59 |
Note: The respective values for FP32 runs that use a batch size of 60 and 10 in sequence lengths 128 and 512 are not available due to out of memory errors that arise.
Our results were obtained by running the run.sub
training script in the TensorFlow 21.02-py3 NGC container using multiple NVIDIA DGX-2 with 16x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.
Num Nodes | Sequence Length | Batch size / GPU: mixed precision, FP32 | Gradient Accumulation: mixed precision, FP32 | Global Batch Size: mixed precision, FP32 | Throughput - mixed precision | Throughput - FP32 | Throughput speedup (FP32 - mixed precision) | Weak scaling - mixed precision | Weak scaling - FP32 |
---|---|---|---|---|---|---|---|---|---|
1 | 128 | 60 , 32 | 64 , 128 | 61440 , 65536 | 3528.51 | 841.72 | 4.19 | 1.00 | 1.00 |
4 | 128 | 60 , 32 | 16 , 32 | 61440 , 65536 | 13370.21 | 3060.49 | 4.37 | 3.79 | 3.64 |
16 | 128 | 60 , 32 | 4 , 8 | 61440 , 65536 | 42697.42 | 10383.57 | 4.11 | 12.1 | 12.34 |
32 | 128 | 60 , 32 | 2 , 4 | 61440 , 65536 | 84223.16 | 20094.14 | 4.19 | 23.87 | 23.87 |
1 | 512 | 10 , 4 | 192 , 256 | 30720 , 32768 | 678.35 | 180 | 3.77 | 1.00 | 1.00 |
4 | 512 | 10 , 4 | 96 , 64 | 30720 , 32768 | 2678.29 | 646.76 | 4.14 | 3.95 | 3.59 |
16 | 512 | 10 , 4 | 24 , 32 | 30720 , 32768 | 7834.72 | 2204.72 | 3.55 | 11.55 | 12.25 |
32 | 512 | 10 , 4 | 6 , 16 | 30720 , 32768 | 18786.93 | 4196.15 | 4.48 | 27.70 | 23.31 |
Note: The respective values for FP32 runs that use a batch size of 60 and 10 in sequence lengths 128 and 512 are not available due to out of memory errors that arise.
Our results were obtained by running the scripts/run_pretraining_lamb.sh
training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX A100 with 8x A100 80GB GPUs. Performance (in sentences per second) is the steady state throughput.
GPUs | Sequence Length | Batch size / GPU: mixed precision, TF32 | Gradient Accumulation: mixed precision, TF32 | Global Batch Size: mixed precision, FP32 | Throughput - mixed precision | Throughput - TF32 | Throughput speedup (TF32 - mixed precision) | Weak scaling - mixed precision | Weak scaling -TF32 |
---|---|---|---|---|---|---|---|---|---|
1 | 128 | 312 , 176 | 256 , 512 | 79872 , 90112 | 485.59 | 282.98 | 1.72 | 1.00 | 1.00 |
8 | 128 | 312 , 176 | 32 , 64 | 79872 , 90112 | 3799.24 | 1944.77 | 1.95 | 7.82 | 6.87 |
1 | 512 | 40 , 22 | 768 , 1536 | 30720 , 33792 | 96.52 | 54.92 | 1.76 | 1.00 | 1.00 |
8 | 512 | 40 , 22 | 96 , 192 | 30720 , 33792 | 649.69 | 427.39 | 1.52 | 6.73 | 7.78 |
Note: The respective values for TF32 runs that use a batch size of 312 and 40 in sequence lengths 128 and 512 are not available due to out of memory errors that arise.
Our results were obtained by running the scripts/run_pretraining_lamb.sh
training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs. Performance (in sentences per second) is the steady state throughput.
Num Nodes | Sequence Length | Batch size / GPU: mixed precision, TF32 | Gradient Accumulation: mixed precision, TF32 | Global Batch Size: mixed precision, FP32 | Throughput - mixed precision | Throughput - TF32 | Throughput speedup (TF32 - mixed precision) | Weak scaling - mixed precision | Weak scaling -TF32 |
---|---|---|---|---|---|---|---|---|---|
1 | 128 | 312 , 176 | 32 , 64 | 79872 , 90112 | 3803.82 | 2062.98 | 1.84 | 1.00 | 1.00 |
2 | 128 | 312 , 176 | 16 , 32 | 79872 , 90112 | 7551.37 | 4084.76 | 1.85 | 1.99 | 1.98 |
8 | 128 | 312 , 176 | 4 , 8 | 79872 , 90112 | 29711.11 | 16134.02 | 1.84 | 7.81 | 7.82 |
32 | 128 | 312 , 176 | 1 , 2 | 79872 , 90112 | 110280.73 | 59569.77 | 1.85 | 28.99 | 28.88 |
1 | 512 | 40 , 22 | 96 , 192 | 30720 , 33792 | 749.73 | 431.89 | 1.74 | 1.00 | 1.00 |
2 | 512 | 40 , 22 | 48 , 96 | 30720 , 33792 | 1491.87 | 739.14 | 2.02 | 1.99 | 1.71 |
8 | 512 | 40 , 22 | 12 , 24 | 30720 , 33792 | 5870.83 | 2926.58 | 2.01 | 7.83 | 6.78 |
32 | 512 | 40 , 22 | 3 , 6 | 30720 , 33792 | 22506.23 | 11240.5 | 2.00 | 30.02 | 26.03 |
Note: The respective values for TF32 runs that use a batch size of 312 and 40 in sequence lengths 128 and 512 are not available due to out of memory errors that arise.
Our results were obtained by running the scripts/run_squad.sh
training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.
GPUs | Batch size / GPU: mixed precision, FP32 | Throughput - mixed precision | Throughput - FP32 | Throughput speedup (FP32 to mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|
1 | 6,3 | 39.10 | 9.85 | 3.97 | 1.00 | 1.00 |
4 | 6,3 | 128.48 | 36.52 | 3.52 | 3.29 | 3.71 |
8 | 6,3 | 255.36 | 73.03 | 3.5 | 6.53 | 7.41 |
Note: The respective values for FP32 runs that use a batch size of 6 are not available due to out of memory errors that arise. Batch size of 6 is only available on using FP16.
To achieve these same results, follow the Quick Start Guide outlined above.
Our results were obtained by running the scripts/run_squad.sh
training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-1 with 8x V100 32GB GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.
GPUs | Batch size / GPU: mixed precision, FP32 | Throughput - mixed precision | Throughput - FP32 | Throughput speedup (FP32 to mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|
1 | 12,8 | 47.06 | 11.11 | 4.24 | 1.00 | 1.00 |
4 | 12,8 | 165.26 | 42.84 | 3.86 | 3.51 | 3.86 |
8 | 12,8 | 330.29 | 85.91 | 3.84 | 7.02 | 7.73 |
Note: The respective values for FP32 runs that use a batch size of 12 are not available due to out of memory errors that arise. Batch size of 12 is only available on using FP16.
To achieve these same results, follow the Quick Start Guide outlined above.
Our results were obtained by running the scripts/run_squad.sh
training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-2 with 16x V100 32GB GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.
GPUs | Batch size / GPU: mixed precision, TF32 | Throughput - mixed precision | Throughput - FP32 | Throughput speedup (FP32 to mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|
1 | 76,38 | 134.22 | 43.9 | 3.057 | 1.00 | 1.00 |
8 | 76,38 | 1048.23 | 341.31 | 3.071 | 7.81 | 7.77 |
Note: The respective values for TF32 runs that use a batch size of 76 are not available due to out of memory errors that arise. Batch size of 12 is only available on using FP16.
To achieve these same results, follow the Quick Start Guide outlined above.
Our results were obtained by running the scripts/finetune_inference_benchmark.sh
training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1000 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.
BERT-LARGE FP16
Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
---|---|---|---|---|---|---|---|
128 | 1 | 105.04 | 1.277237354 | 9.52 | 9.67 | 9.77 | 10.16 |
128 | 2 | 184.9 | 1.671487977 | 10.82 | 11.15 | 11.27 | 11.8 |
128 | 4 | 301.9 | 2.448102498 | 13.25 | 13.38 | 13.45 | 13.96 |
128 | 8 | 421.98 | 3.149809659 | 18.96 | 19.12 | 19.2 | 19.82 |
384 | 1 | 74.99 | 2.15055922 | 13.34 | 13.5 | 13.58 | 14.53 |
384 | 2 | 109.84 | 2.709422792 | 18.21 | 18.4 | 18.6 | 19.39 |
384 | 4 | 142.58 | 3.313502208 | 28.05 | 28.28 | 28.48 | 28.85 |
384 | 8 | 168.34 | 3.823302294 | 47.52 | 47.74 | 47.86 | 48.52 |
BERT-Large FP32
Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
---|---|---|---|---|---|---|
128 | 1 | 82.24 | 12.16 | 12.28 | 12.33 | 12.92 |
128 | 2 | 110.62 | 18.08 | 18.22 | 18.28 | 18.88 |
128 | 4 | 123.32 | 32.44 | 32.72 | 32.82 | 32.98 |
128 | 8 | 133.97 | 59.71 | 60.29 | 60.49 | 60.69 |
384 | 1 | 34.87 | 28.67 | 28.92 | 29.02 | 29.33 |
384 | 2 | 40.54 | 49.34 | 49.74 | 49.86 | 50.05 |
384 | 4 | 43.03 | 92.97 | 93.59 | 93.75 | 94.57 |
384 | 8 | 44.03 | 181.71 | 182.34 | 182.48 | 183.03 |
BERT-Base FP16
Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
---|---|---|---|---|---|---|---|
128 | 1 | 236.26 | 1.179589595 | 4.23 | 4.37 | 4.49 | 4.59 |
128 | 2 | 425.1 | 1.441554478 | 4.7 | 4.84 | 4.97 | 5.26 |
128 | 4 | 710.48 | 1.911691107 | 5.63 | 5.78 | 5.93 | 6.4 |
128 | 8 | 1081.17 | 2.523032764 | 7.4 | 7.5 | 7.54 | 7.73 |
384 | 1 | 190.53 | 1.757170525 | 5.25 | 5.35 | 5.42 | 5.8 |
384 | 2 | 289.67 | 2.248292456 | 6.9 | 7.08 | 7.24 | 7.57 |
384 | 4 | 404.03 | 2.946328302 | 9.9 | 10 | 10.03 | 10.13 |
384 | 8 | 504.24 | 3.450153951 | 15.87 | 15.96 | 16.01 | 16.3 |
BERT-Base FP32
Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
---|---|---|---|---|---|---|
128 | 1 | 200.29 | 4.99 | 5.08 | 5.16 | 5.53 |
128 | 2 | 294.89 | 6.78 | 6.89 | 6.93 | 7.37 |
128 | 4 | 371.65 | 10.76 | 10.89 | 10.96 | 11.92 |
128 | 8 | 428.52 | 18.67 | 18.89 | 18.98 | 19.17 |
384 | 1 | 108.43 | 9.22 | 9.26 | 9.31 | 10.24 |
384 | 2 | 128.84 | 15.52 | 15.6 | 15.71 | 16.49 |
384 | 4 | 137.13 | 29.17 | 29.4 | 29.48 | 29.64 |
384 | 8 | 146.15 | 54.74 | 55.19 | 55.3 | 55.54 |
To achieve these same results, follow the Quick Start Guide outlined above.
Our results were obtained by running the scripts/finetune_inference_benchmark.sh
training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-1 with 1x V100 32GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1000 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.
BERTLarge FP16
Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
---|---|---|---|---|---|---|---|
128 | 1 | 101.58 | 1.242112986 | 9.84 | 9.99 | 10.06 | 10.39 |
128 | 2 | 181.89 | 1.651593571 | 11 | 11.14 | 11.2 | 11.87 |
128 | 4 | 295.86 | 2.348840902 | 13.52 | 13.67 | 13.75 | 14.5 |
128 | 8 | 411.29 | 3.010246652 | 19.45 | 19.62 | 19.69 | 20.4 |
384 | 1 | 72.95 | 2.083690374 | 13.71 | 13.93 | 14.08 | 14.81 |
384 | 2 | 107.02 | 2.583775954 | 18.69 | 18.8 | 18.88 | 19.57 |
384 | 4 | 139.8 | 3.14652262 | 28.61 | 28.75 | 28.88 | 29.6 |
384 | 8 | 163.68 | 3.595782074 | 48.88 | 49.09 | 49.18 | 49.77 |
BERT-Large FP32
Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
---|---|---|---|---|---|---|
128 | 1 | 81.78 | 12.23 | 12.37 | 12.43 | 13.2 |
128 | 2 | 110.13 | 18.16 | 18.29 | 18.37 | 19.27 |
128 | 4 | 125.96 | 31.76 | 32.09 | 32.21 | 32.42 |
128 | 8 | 136.63 | 58.55 | 58.93 | 59.05 | 59.4 |
384 | 1 | 35.01 | 28.56 | 28.81 | 28.94 | 29.16 |
384 | 2 | 41.42 | 48.29 | 48.57 | 48.67 | 49.02 |
384 | 4 | 44.43 | 90.03 | 90.43 | 90.59 | 90.89 |
384 | 8 | 45.52 | 175.76 | 176.66 | 176.89 | 177.33 |
BERT-Base FP16
Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
---|---|---|---|---|---|---|---|
128 | 1 | 234.85 | 1.217533309 | 4.26 | 4.33 | 4.37 | 4.62 |
128 | 2 | 415.86 | 1.435782351 | 4.81 | 4.92 | 5.06 | 5.55 |
128 | 4 | 680.09 | 1.84912586 | 5.88 | 6.1 | 6.2 | 6.53 |
128 | 8 | 1030.03 | 2.264548752 | 7.77 | 7.87 | 7.95 | 8.53 |
384 | 1 | 183.18 | 1.700993593 | 5.46 | 5.56 | 5.61 | 5.93 |
384 | 2 | 275.77 | 2.175528558 | 7.25 | 7.38 | 7.44 | 7.89 |
384 | 4 | 385.61 | 2.778570399 | 10.37 | 10.56 | 10.63 | 11.1 |
384 | 8 | 488.45 | 3.292329469 | 16.38 | 16.48 | 16.52 | 16.64 |
BERT-Base FP32
Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
---|---|---|---|---|---|---|
128 | 1 | 192.89 | 5.18 | 5.3 | 5.36 | 5.65 |
128 | 2 | 289.64 | 6.91 | 7 | 7.22 | 7.83 |
128 | 4 | 367.79 | 10.88 | 10.98 | 11.02 | 11.59 |
128 | 8 | 454.85 | 17.59 | 17.76 | 17.81 | 17.92 |
384 | 1 | 107.69 | 9.29 | 9.37 | 9.42 | 9.88 |
384 | 2 | 126.76 | 15.78 | 15.89 | 15.97 | 16.72 |
384 | 4 | 138.78 | 28.82 | 28.98 | 29.06 | 29.88 |
384 | 8 | 148.36 | 53.92 | 54.16 | 54.26 | 54.58 |
To achieve these same results, follow the Quick Start Guide outlined above.
Our results were obtained by running the scripts/finetune_inference_benchmark.sh
training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-2 with 1x V100 32GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1000 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.
BERT-Large FP16
Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
---|---|---|---|---|---|---|---|
128 | 1 | 145.21 | 0.9435347628 | 6.89 | 7.14 | 7.4 | 8.35 |
128 | 2 | 272.81 | 1.093953003 | 7.33 | 7.61 | 7.77 | 8.35 |
128 | 4 | 468.98 | 1.273087573 | 8.53 | 8.71 | 8.83 | 9.85 |
128 | 8 | 705.67 | 1.191627687 | 11.34 | 11.64 | 11.9 | 13.1 |
384 | 1 | 118.34 | 1.042459479 | 8.45 | 8.82 | 8.99 | 9.52 |
384 | 2 | 197.8 | 1.231478023 | 10.11 | 10.48 | 10.62 | 11.4 |
384 | 4 | 275.19 | 1.268332027 | 14.54 | 14.73 | 14.8 | 16.8 |
384 | 8 | 342.22 | 1.416004634 | 23.38 | 23.64 | 23.75 | 24.1 |
BERT-Large TF32
Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
---|---|---|---|---|---|---|
128 | 1 | 153.9 | 6.5 | 6.76 | 6.86 | 7.4 |
128 | 2 | 249.38 | 8.02 | 8.22 | 8.34 | 9.45 |
128 | 4 | 368.38 | 10.86 | 11.11 | 11.24 | 12.76 |
128 | 8 | 592.19 | 13.51 | 13.64 | 13.77 | 15.85 |
384 | 1 | 113.52 | 8.81 | 9.02 | 9.16 | 10.19 |
384 | 2 | 160.62 | 12.45 | 12.61 | 12.68 | 14.47 |
384 | 4 | 216.97 | 18.44 | 18.6 | 18.7 | 18.84 |
384 | 8 | 241.68 | 33.1 | 33.29 | 33.36 | 33.5 |
BERT-Base FP16
Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
---|---|---|---|---|---|---|---|
128 | 1 | 295.01 | 1.014023992 | 3.39 | 3.59 | 3.65 | 3.73 |
128 | 2 | 594.81 | 1.048455898 | 3.36 | 3.59 | 3.68 | 4.19 |
128 | 4 | 1043.12 | 1.005145599 | 3.83 | 3.97 | 4.2 | 4.44 |
128 | 8 | 1786.25 | 1.198278638 | 4.48 | 4.73 | 4.8 | 5.19 |
384 | 1 | 278.85 | 1.103395062 | 3.59 | 3.67 | 3.99 | 4.15 |
384 | 2 | 464.77 | 1.252006896 | 4.3 | 4.59 | 4.87 | 5.29 |
384 | 4 | 675.82 | 1.264822578 | 5.92 | 6.15 | 6.27 | 6.94 |
384 | 8 | 846.81 | 1.31109494 | 9.45 | 9.65 | 9.74 | 11.03 |
BERT-Base TF32
Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
---|---|---|---|---|---|---|
128 | 1 | 290.93 | 3.44 | 3.61 | 3.73 | 4.69 |
128 | 2 | 567.32 | 3.53 | 3.64 | 3.96 | 5.01 |
128 | 4 | 1037.78 | 3.85 | 3.95 | 4.06 | 4.58 |
128 | 8 | 1490.68 | 5.37 | 5.61 | 5.66 | 6.19 |
384 | 1 | 252.72 | 3.96 | 3.96 | 4.52 | 4.66 |
384 | 2 | 371.22 | 5.39 | 5.64 | 5.71 | 6.38 |
384 | 4 | 534.32 | 7.49 | 7.69 | 7.76 | 8.56 |
384 | 8 | 645.88 | 12.39 | 12.61 | 12.67 | 12.77 |
To achieve these same results, follow the Quick Start Guide outlined above.
Our results were obtained by running the scripts/finetune_inference_benchmark.sh
training script in the TensorFlow 21.02-py3 NGC container on NVIDIA Tesla T4 with 1x T4 16GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1000 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.
BERT-Large FP16
Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
---|---|---|---|---|---|---|---|
128 | 1 | 57.6 | 1.364605544 | 17.36 | 18.16 | 19.02 | 21.67 |
128 | 2 | 102.76 | 2.17988969 | 19.46 | 20.68 | 21.27 | 22.2 |
128 | 4 | 151.11 | 3.146813828 | 26.47 | 26.9 | 27.06 | 27.45 |
128 | 8 | 186.99 | 3.733080455 | 42.78 | 43.87 | 44.18 | 44.78 |
384 | 1 | 38.88 | 2.590273151 | 25.72 | 26.06 | 26.16 | 26.38 |
384 | 2 | 50.53 | 3.202154626 | 39.58 | 39.93 | 40.35 | 40.95 |
384 | 4 | 57.69 | 3.721935484 | 69.34 | 70.5 | 70.77 | 71.09 |
384 | 8 | 62.99 | 3.927057357 | 127 | 129.18 | 130.07 | 131.86 |
BERT-Large FP32
Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
---|---|---|---|---|---|---|
128 | 1 | 42.21 | 23.69 | 24.8 | 25.02 | 25.48 |
128 | 2 | 47.14 | 42.42 | 43.48 | 43.63 | 44.32 |
128 | 4 | 48.02 | 83.29 | 84.37 | 84.68 | 85.14 |
128 | 8 | 50.09 | 159.72 | 161.66 | 161.97 | 162.52 |
384 | 1 | 15.01 | 66.63 | 67.76 | 68.08 | 68.66 |
384 | 2 | 15.78 | 126.78 | 128.21 | 128.58 | 129.08 |
384 | 4 | 15.5 | 258.1 | 261.01 | 261.66 | 262.55 |
384 | 8 | 16.04 | 498.61 | 504.29 | 504.74 | 505.55 |
BERT-Base FP16
Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
---|---|---|---|---|---|---|---|
128 | 1 | 116.56 | 1.039878669 | 8.58 | 9.53 | 10.84 | 11.74 |
128 | 2 | 238.62 | 1.675937632 | 8.38 | 9.09 | 9.27 | 12.33 |
128 | 4 | 402.93 | 2.440964439 | 9.93 | 10.07 | 10.13 | 12.17 |
128 | 8 | 532.56 | 3.052619512 | 15.02 | 15.43 | 15.6 | 16.52 |
384 | 1 | 102.12 | 2.035073735 | 9.79 | 11.06 | 11.18 | 12.07 |
384 | 2 | 149.3 | 2.910898811 | 13.4 | 13.54 | 13.62 | 14.36 |
384 | 4 | 177.78 | 3.563439567 | 22.5 | 23.11 | 23.27 | 23.59 |
384 | 8 | 192.61 | 3.752386519 | 41.53 | 42.67 | 42.81 | 43.31 |
BERT-Base FP32
Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
---|---|---|---|---|---|---|
128 | 1 | 112.09 | 8.92 | 9.12 | 9.49 | 10.93 |
128 | 2 | 142.38 | 14.05 | 14.34 | 14.48 | 15.03 |
128 | 4 | 165.07 | 24.23 | 24.86 | 24.92 | 25.05 |
128 | 8 | 174.46 | 45.86 | 46.71 | 46.8 | 47.2 |
384 | 1 | 50.18 | 19.93 | 20.53 | 21.04 | 21.73 |
384 | 2 | 51.29 | 38.99 | 39.68 | 39.93 | 40.2 |
384 | 4 | 49.89 | 80.18 | 81.54 | 82 | 82.65 |
384 | 8 | 51.33 | 155.85 | 158.11 | 158.5 | 159.17 |
To achieve these same results, follow the Quick Start Guide outlined above.