The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
Both of these benchmarking scripts enable you to run a number of epochs, extract performance numbers, and run the BERT model for fine tuning.
Training benchmarking can be performed by running the script:
scripts/finetune_train_benchmark.sh <bert_model> <use_xla> <num_gpu> squad
This script runs 2 epochs by default on the SQuAD v1.1 dataset and extracts performance numbers for various batch sizes and sequence lengths in both FP16 and FP32/TF32. These numbers are saved at /results/squad_train_benchmark_bert_<bert_model>_gpu_<num_gpu>.log
.
Inference benchmarking can be performed by running the script:
scripts/finetune_inference_benchmark.sh squad
This script runs 1024 eval iterations by default on the SQuAD v1.1 dataset and extracts performance and latency numbers for various batch sizes and sequence lengths in both FP16 and FP32/TF32, for base and large models. These numbers are saved at /results/squad_inference_benchmark_bert_<bert_model>.log
.
The following sections provide details on how we achieved our performance and accuracy in training and inference for pre-training using LAMB optimizer as well as fine tuning for Question Answering. All results are on BERT-large model unless otherwise mentioned. All fine tuning results are on SQuAD v1.1 using a sequence length of 384 unless otherwise mentioned.
Our results were obtained by running the scripts/run_pretraining_lamb.sh
training script in the TensorFlow 20.06-py3 NGC container.
DGX System | Nodes x GPUs | Precision | Batch Size/GPU: Phase1, Phase2 | Accumulation Steps: Phase1, Phase2 | Time to Train (Hrs) | Final Loss |
---|---|---|---|---|---|---|
DGX2H | 32 x 16 | FP16 | 64, 8 | 2, 8 | 2.63 | 1.59 |
DGX2H | 32 x 16 | FP32 | 32, 8 | 4, 8 | 8.48 | 1.56 |
DGXA100 | 32 x 8 | FP16 | 64, 16 | 4, 8 | 3.24 | 1.56 |
DGXA100 | 32 x 8 | TF32 | 64, 8 | 4, 16 | 4.58 | 1.58 |
Note: Time to train includes upto 16 minutes of start up time for every restart (atleast once for each phase). Experiments were run on clusters with a maximum wall clock time of 8 hours.
Our results were obtained by running the scripts/run_squad.sh
training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs.
GPUs | **Batch size / GPU: TF32, FP16 ** | Accuracy - TF32 | Accuracy - mixed precision | Time to Train - TF32 (Hrs) | Time to Train - mixed precision (Hrs) |
---|---|---|---|---|---|
8 | 16, 24 | 91.41 | 91.52 | 0.26 | 0.26 |
Our results were obtained by running the scripts/run_glue.sh
training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs for 10 different seeds and picking the maximum accuracy on MRPC dev set.
GPUs | Batch size / GPU | Accuracy - TF32 | Accuracy - mixed precision | Time to Train - TF32 (Hrs) | Time to Train - mixed precision (Hrs) | Throughput - TF32 | **Throughput - mixed precision ** |
---|---|---|---|---|---|---|---|
8 | 16 | 87.99 | 87.09 | 0.009 | 0.009 | 357.91 | 230.16 |
The following tables compare Final Loss
scores across 2 different training runs with different seeds, for both FP16 and TF32. The runs showcase consistent convergence on all 2 seeds with very little deviation.
FP16, 256x GPUs | seed 1 | seed 2 | mean | std |
---|---|---|---|---|
Final Loss | 1.570 | 1.561 | 1.565 | 0.006 |
TF32, 256x GPUs | seed 1 | seed 2 | mean | std |
---|---|---|---|---|
Final Loss | 1.583 | 1.582 | 1.582 | 0.0007 |
The following tables compare F1
scores across 5 different training runs with different seeds, for both FP16 and TF32 respectively using (Nvidia's Pretrained Checkpoint)[https://ngc.nvidia.com/catalog/models/nvidia:bert_tf_pretraining_lamb_16n]. The runs showcase consistent convergence on all 5 seeds with very little deviation.
FP16, 8x GPUs | seed 1 | seed 2 | seed 3 | seed 4 | seed 5 | mean | std |
---|---|---|---|---|---|---|---|
F1 | 91.61 | 91.04 | 91.59 | 91.32 | 91.52 | 91.41 | 0.24 |
TF32, 8x GPUs | seed 1 | seed 2 | seed 3 | seed 4 | seed 5 | mean | std |
---|---|---|---|---|---|---|---|
F1 | 91.50 | 91.49 | 91.64 | 91.29 | 91.67 | 91.52 | 0.15 |
The following tables compare F1
scores across 10 different training runs with different seeds, for both FP16 and TF32 respectively using (Nvidia's Pretrained Checkpoint)[https://ngc.nvidia.com/catalog/models/nvidia:bert_tf_pretraining_lamb_16n]. The runs showcase consistent convergence on all 10 seeds with very little deviation.
** FP16, 8 GPUs ** | ** seed 1 ** | ** seed 2 ** | ** seed 3 ** | ** seed 4 ** | ** seed 5 ** | ** seed 6 ** | ** seed 7 ** | ** seed 8 ** | ** seed 9 ** | ** seed 10 ** | ** Mean ** | ** Std ** |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Eval Accuracy | 84.31372643 | 85.78431606 | 86.76471114 | 87.00980544 | 86.27451062 | 86.27451062 | 85.5392158 | 86.51961088 | 86.27451062 | 85.2941215 | 86.00490391 | 0.795887906 |
** TF32, 8 GPUs ** | ** seed 1 ** | ** seed 2 ** | ** seed 3 ** | ** seed 4 ** | ** seed 5 ** | ** seed 6 ** | ** seed 7 ** | ** seed 8 ** | ** seed 9 ** | ** seed 10 ** | ** Mean ** | ** Std ** |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Eval Accuracy | 87.00980544 | 86.27451062 | 87.99020052 | 86.27451062 | 86.02941632 | 87.00980544 | 86.27451062 | 86.51961088 | 87.74510026 | 86.02941632 | 86.7156887 | 0.7009024515 |
Our results were obtained by running the scripts/run_pretraining_lamb.sh
training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. Performance (in sentences per second) is the steady state throughput.
GPUs | Sequence Length | Batch size / GPU: mixed precision, FP32 | Gradient Accumulation: mixed precision, FP32 | Global Batch Size | Throughput - mixed precision | Throughput - FP32 | Throughput speedup (FP32 - mixed precision) | Weak scaling - mixed precision | Weak scaling - FP32 |
---|---|---|---|---|---|---|---|---|---|
1 | 128 | 16 , 8 | 4096, 8192 | 65536 | 134.34 | 39.43 | 3.41 | 1.00 | 1.00 |
4 | 128 | 16 , 8 | 1024, 2048 | 65536 | 449.68 | 152.33 | 2.95 | 3.35 | 3.86 |
8 | 128 | 16 , 8 | 512, 1024 | 65536 | 1001.39 | 285.79 | 3.50 | 7.45 | 7.25 |
1 | 512 | 4 , 2 | 8192, 16384 | 32768 | 28.72 | 9.80 | 2.93 | 1.00 | 1.00 |
4 | 512 | 4 , 2 | 2048, 4096 | 32768 | 109.96 | 35.32 | 3.11 | 3.83 | 3.60 |
8 | 512 | 4 , 2 | 1024, 2048 | 32768 | 190.65 | 69.53 | 2.74 | 6.64 | 7.09 |
Note: The respective values for FP32 runs that use a batch size of 16, 4 in sequence lengths 128 and 512 respectively are not available due to out of memory errors that arise.
Our results were obtained by running the scripts/run_squad.sh
training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. Performance (in sentences per second) is the steady state throughput.
GPUs | Batch size / GPU: mixed precision, FP32 | Throughput - mixed precision | Throughput - FP32 | Throughput speedup (FP32 to mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|
1 | 4,2 | 29.74 | 7.36 | 4.04 | 1.00 | 1.00 |
4 | 4,2 | 97.28 | 26.64 | 3.65 | 3.27 | 3.62 |
8 | 4,2 | 189.77 | 52.39 | 3.62 | 6.38 | 7.12 |
Note: The respective values for FP32 runs that use a batch size of 4 are not available due to out of memory errors that arise.
To achieve these same results, follow the Quick Start Guide outlined above.
Our results were obtained by running the scripts/run_pretraining_lamb.sh
training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.
GPUs | Sequence Length | Batch size / GPU: mixed precision, FP32 | Gradient Accumulation: mixed precision, FP32 | Global Batch Size | Throughput - mixed precision | Throughput - FP32 | Throughput speedup (FP32 - mixed precision) | Weak scaling - mixed precision | Weak scaling - FP32 |
---|---|---|---|---|---|---|---|---|---|
1 | 128 | 64 , 32 | 1024, 2048 | 65536 | 168.63 | 46.78 | 3.60 | 1.00 | 1.00 |
4 | 128 | 64 , 32 | 256, 512 | 65536 | 730.25 | 179.73 | 4.06 | 4.33 | 3.84 |
8 | 128 | 64 , 32 | 128, 256 | 65536 | 1443.05 | 357.00 | 4.04 | 8.56 | 7.63 |
1 | 512 | 8 , 8 | 4096, 4096 | 32768 | 31.23 | 10.67 | 2.93 | 1.00 | 1.00 |
4 | 512 | 8 , 8 | 1024, 1024 | 32768 | 118.84 | 39.55 | 3.00 | 3.81 | 3.71 |
8 | 512 | 8 , 8 | 512, 512 | 32768 | 255.64 | 81.42 | 3.14 | 8.19 | 7.63 |
Note: The respective values for FP32 runs that use a batch size of 64 in sequence lengths 128 are not available due to out of memory errors that arise.
Our results were obtained by running the scripts/run_squad.sh
training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.
GPUs | Batch size / GPU: mixed precision, FP32 | Throughput - mixed precision | Throughput - FP32 | Throughput speedup (FP32 to mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|
1 | 24, 10 | 51.02 | 10.42 | 4.90 | 1.00 | 1.00 |
4 | 24, 10 | 181.37 | 39.77 | 4.56 | 3.55 | 3.82 |
8 | 24, 10 | 314.6 | 79.37 | 3.96 | 6.17 | 7.62 |
Note: The respective values for FP32 runs that use a batch size of 24 are not available due to out of memory errors that arise.
To achieve these same results, follow the Quick Start Guide outlined above.
Our results were obtained by running the scripts/run_pretraining_lamb.sh
training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-2 with 16x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.
GPUs | Sequence Length | Batch size / GPU: mixed precision, FP32 | Gradient Accumulation: mixed precision, FP32 | Global Batch Size | Throughput - mixed precision | Throughput - FP32 | Throughput speedup (FP32 - mixed precision) | Weak scaling - mixed precision | Weak scaling - FP32 |
---|---|---|---|---|---|---|---|---|---|
1 | 128 | 64 , 32 | 1024 , 8192 | 65536 | 188.04 | 35.32 | 5.32 | 1.00 | 1.00 |
4 | 128 | 64 , 32 | 256 , 2048 | 65536 | 790.89 | 193.08 | 4.10 | 4.21 | 5.47 |
8 | 128 | 64 , 32 | 128 , 1024 | 65536 | 1556.89 | 386.89 | 4.02 | 8.28 | 10.95 |
16 | 128 | 64 , 32 | 64 , 128 | 65536 | 3081.69 | 761.92 | 4.04 | 16.39 | 21.57 |
1 | 512 | 8 , 8 | 4096 , 4096 | 32768 | 35.32 | 11.67 | 3.03 | 1.00 | 1.00 |
4 | 512 | 8 , 8 | 1024 , 1024 | 32768 | 128.98 | 42.84 | 3.01 | 3.65 | 3.67 |
8 | 512 | 8 , 8 | 512 , 512 | 32768 | 274.04 | 86.78 | 3.16 | 7.76 | 7.44 |
16 | 512 | 8 , 8 | 256 , 256 | 32768 | 513.43 | 173.26 | 2.96 | 14.54 | 14.85 |
Note: The respective values for FP32 runs that use a batch size of 64 in sequence lengths 128 are not available due to out of memory errors that arise.
Our results were obtained by running the run.sub
training script in the TensorFlow 19.08-py3 NGC container using multiple NVIDIA DGX-2 with 16x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.
Num Nodes | Sequence Length | Batch size / GPU: mixed precision, FP32 | Gradient Accumulation: mixed precision, FP32 | Global Batch Size | Throughput - mixed precision | Throughput - FP32 | Throughput speedup (FP32 - mixed precision) | Weak scaling - mixed precision | Weak scaling - FP32 |
---|---|---|---|---|---|---|---|---|---|
1 | 128 | 64 , 32 | 64 , 128 | 65536 | 3081.69 | 761.92 | 4.04 | 1.00 | 1.00 |
4 | 128 | 64 , 32 | 16 , 32 | 65536 | 13192.00 | 3389.83 | 3.89 | 4.28 | 4.45 |
16 | 128 | 64 , 32 | 4 , 8 | 65536 | 48223.00 | 13217.78 | 3.65 | 15.65 | 17.35 |
32 | 128 | 64 , 32 | 2 , 4 | 65536 | 86673.64 | 25142.26 | 3.45 | 28.13 | 33.00 |
1 | 512 | 8 , 8 | 256 , 256 | 32768 | 577.79 | 173.26 | 3.33 | 1.00 | 1.00 |
4 | 512 | 8 , 8 | 64 , 64 | 32768 | 2284.23 | 765.04 | 2.99 | 3.95 | 4.42 |
16 | 512 | 8 , 8 | 16 , 16 | 32768 | 8853.00 | 3001.43 | 2.95 | 15.32 | 17.32 |
32 | 512 | 8 , 8 | 8 , 8 | 32768 | 17059.00 | 5893.14 | 2.89 | 29.52 | 34.01 |
Note: The respective values for FP32 runs that use a batch size of 64 in sequence lengths 128 are not available due to out of memory errors that arise.
Our results were obtained by running the scripts/run_squad.sh
training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-2 with 16x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.
GPUs | Batch size / GPU: mixed precision, FP32 | Throughput - mixed precision | Throughput - FP32 | Throughput speedup (FP32 to mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|
1 | 24, 10 | 55.28 | 11.15 | 4.96 | 1.00 | 1.00 |
4 | 24, 10 | 199.53 | 42.91 | 4.65 | 3.61 | 3.85 |
8 | 24, 10 | 341.55 | 85.08 | 4.01 | 6.18 | 7.63 |
16 | 24, 10 | 683.37 | 156.29 | 4.37 | 12.36 | 14.02 |
Note: The respective values for FP32 runs that use a batch size of 24 are not available due to out of memory errors that arise.
To achieve these same results, follow the Quick Start Guide outlined above.
Our results were obtained by running the scripts/run_pretraining_lamb.sh
training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs. Performance (in sentences per second) is the steady state throughput.
GPUs | Sequence Length | Batch size / GPU: mixed precision, TF32 | Gradient Accumulation: mixed precision, TF32 | Global Batch Size | Throughput - mixed precision | Throughput - TF32 | Throughput speedup (TF32 - mixed precision) | Weak scaling - mixed precision | Weak scaling -TF32 |
---|---|---|---|---|---|---|---|---|---|
1 | 128 | 64 , 64 | 1024 , 1024 | 65536 | 356.845 | 238.10 | 1.50 | 1.00 | 1.00 |
4 | 128 | 64 , 64 | 256 , 256 | 65536 | 1422.25 | 952.39 | 1.49 | 3.99 | 4.00 |
8 | 128 | 64 , 64 | 128 , 128 | 65536 | 2871.89 | 1889.71 | 1.52 | 8.05 | 7.94 |
1 | 512 | 16 , 8 | 2048 , 4096 | 32768 | 70.856 | 39.96 | 1.77 | 1.00 | 1.00 |
4 | 512 | 16 , 8 | 512 , 1024 | 32768 | 284.912 | 160.16 | 1.78 | 4.02 | 4.01 |
8 | 512 | 16 , 8 | 256 , 512 | 32768 | 572.112 | 316.51 | 1.81 | 8.07 | 7.92 |
Note: The respective values for TF32 runs that use a batch size of 16 for sequence length 512 are not available due to out of memory errors that arise.
Our results were obtained by running the scripts/run_pretraining_lamb.sh
training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs. Performance (in sentences per second) is the steady state throughput.
Num Nodes | Sequence Length | Batch size / GPU: mixed precision, TF32 | Gradient Accumulation: mixed precision, TF32 | Global Batch Size | Throughput - mixed precision | Throughput - TF32 | Throughput speedup (TF32 - mixed precision) | Weak scaling - mixed precision | Weak scaling -TF32 |
---|---|---|---|---|---|---|---|---|---|
1 | 128 | 64 , 64 | 128 , 128 | 65536 | 2871.89 | 1889.71 | 1.52 | 1.00 | 1.00 |
4 | 128 | 64 , 64 | 32 , 32 | 65536 | 11159 | 7532.00 | 1.48 | 3.89 | 3.99 |
16 | 128 | 64 , 64 | 8 , 8 | 65536 | 41144 | 28605.62 | 1.44 | 14.33 | 15.14 |
32 | 128 | 64 , 64 | 4 , 4 | 65536 | 77479.87 | 53585.82 | 1.45 | 26.98 | 28.36 |
1 | 512 | 16 , 8 | 256 , 512 | 32768 | 572.112 | 316.51 | 1.81 | 1.00 | 1.00 |
4 | 512 | 16 , 8 | 128 , 128 | 65536 | 2197.44 | 1268.43 | 1.73 | 3.84 | 4.01 |
16 | 512 | 16 , 8 | 32 , 32 | 65536 | 8723.1 | 4903.39 | 1.78 | 15.25 | 15.49 |
32 | 512 | 16 , 8 | 16 , 16 | 65536 | 16705 | 9463.80 | 1.77 | 29.20 | 29.90 |
Note: The respective values for TF32 runs that use a batch size of 16 for sequence length 512 are not available due to out of memory errors that arise.
Our results were obtained by running the scripts/run_squad.sh
training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs. Performance (in sentences per second) is the steady state throughput.
GPUs | Batch size / GPU: mixed precision, TF32 | Throughput - mixed precision | Throughput - TF32 | Throughput speedup (TF32 to mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|
1 | 32, 16 | 102.26 | 61.364 | 1.67 | 1.00 | 1.00 |
4 | 32, 16 | 366.353 | 223.187 | 1.64 | 3.64 | 3.58 |
8 | 32, 16 | 767.071 | 440.47 | 1.74 | 7.18 | 7.50 |
Note: The respective values for TF32 runs that use a batch size of 32 are not available due to out of memory errors that arise.
To achieve these same results, follow the Quick Start Guide outlined above.
Our results were obtained by running the scripts/finetune_inference_benchmark.sh
script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.
Model | Sequence Length | Batch Size | Precision | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
---|---|---|---|---|---|---|---|---|
base | 128 | 1 | fp16 | 206.82 | 7.96 | 4.98 | 5.04 | 5.23 |
base | 128 | 2 | fp16 | 376.75 | 8.68 | 5.42 | 5.49 | 5.64 |
base | 128 | 4 | fp16 | 635 | 12.31 | 6.46 | 6.55 | 6.83 |
base | 128 | 8 | fp16 | 962.83 | 13.64 | 8.47 | 8.56 | 8.75 |
base | 384 | 1 | fp16 | 167.01 | 12.77 | 6.12 | 6.23 | 6.52 |
base | 384 | 2 | fp16 | 252.12 | 21.05 | 8.03 | 8.09 | 8.61 |
base | 384 | 4 | fp16 | 341.95 | 25.09 | 11.88 | 11.96 | 12.52 |
base | 384 | 8 | fp16 | 421.26 | 33.16 | 19.2 | 19.37 | 19.91 |
base | 128 | 1 | fp32 | 174.48 | 8.17 | 5.89 | 5.95 | 6.12 |
base | 128 | 2 | fp32 | 263.67 | 10.33 | 7.66 | 7.69 | 7.92 |
base | 128 | 4 | fp32 | 349.34 | 16.31 | 11.57 | 11.62 | 11.87 |
base | 128 | 8 | fp32 | 422.88 | 23.27 | 19.23 | 19.38 | 20.38 |
base | 384 | 1 | fp32 | 99.52 | 14.99 | 10.19 | 10.23 | 10.78 |
base | 384 | 2 | fp32 | 118.01 | 25.98 | 17.12 | 17.18 | 17.78 |
base | 384 | 4 | fp32 | 128.1 | 41 | 31.56 | 31.7 | 32.39 |
base | 384 | 8 | fp32 | 136.1 | 69.77 | 59.44 | 59.66 | 60.51 |
large | 128 | 1 | fp16 | 98.63 | 15.86 | 10.27 | 10.31 | 10.46 |
large | 128 | 2 | fp16 | 172.59 | 17.78 | 11.81 | 11.86 | 12.13 |
large | 128 | 4 | fp16 | 272.86 | 25.66 | 14.86 | 14.94 | 15.18 |
large | 128 | 8 | fp16 | 385.64 | 30.74 | 20.98 | 21.1 | 21.68 |
large | 384 | 1 | fp16 | 70.74 | 26.85 | 14.38 | 14.47 | 14.7 |
large | 384 | 2 | fp16 | 99.9 | 45.29 | 20.26 | 20.43 | 21.11 |
large | 384 | 4 | fp16 | 128.42 | 56.94 | 31.44 | 31.71 | 32.45 |
large | 384 | 8 | fp16 | 148.57 | 81.69 | 54.23 | 54.54 | 55.53 |
large | 128 | 1 | fp32 | 76.75 | 17.06 | 13.21 | 13.27 | 13.4 |
large | 128 | 2 | fp32 | 100.82 | 24.34 | 20.05 | 20.13 | 21.13 |
large | 128 | 4 | fp32 | 117.59 | 41.76 | 34.42 | 34.55 | 35.29 |
large | 128 | 8 | fp32 | 130.42 | 68.59 | 62 | 62.23 | 62.98 |
large | 384 | 1 | fp32 | 33.95 | 37.89 | 29.82 | 29.98 | 30.56 |
large | 384 | 2 | fp32 | 38.47 | 68.35 | 52.56 | 52.74 | 53.89 |
large | 384 | 4 | fp32 | 41.11 | 114.27 | 98.19 | 98.54 | 99.54 |
large | 384 | 8 | fp32 | 41.32 | 213.84 | 194.92 | 195.36 | 196.94 |
To achieve these same results, follow the Quick Start Guide outlined above.
Our results were obtained by running the scripts/finetune_inference_benchmark.sh
training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 32GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.
Model | Sequence Length | Batch Size | Precision | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
---|---|---|---|---|---|---|---|---|
base | 128 | 1 | fp16 | 207.87 | 7.63 | 4.94 | 5.03 | 5.32 |
base | 128 | 2 | fp16 | 376.44 | 8.47 | 5.44 | 5.5 | 5.68 |
base | 128 | 4 | fp16 | 642.55 | 11.63 | 6.3 | 6.36 | 6.68 |
base | 128 | 8 | fp16 | 943.85 | 13.24 | 8.56 | 8.68 | 8.92 |
base | 384 | 1 | fp16 | 162.62 | 12.24 | 6.31 | 6.4 | 6.73 |
base | 384 | 2 | fp16 | 244.15 | 20.05 | 8.34 | 8.41 | 8.93 |
base | 384 | 4 | fp16 | 338.68 | 23.53 | 11.88 | 11.92 | 12.63 |
base | 384 | 8 | fp16 | 407.46 | 32.72 | 19.84 | 20.06 | 20.89 |
base | 128 | 1 | fp32 | 175.16 | 8.31 | 5.85 | 5.89 | 6.04 |
base | 128 | 2 | fp32 | 261.31 | 10.48 | 7.75 | 7.81 | 8.08 |
base | 128 | 4 | fp32 | 339.45 | 16.67 | 11.95 | 12.02 | 12.46 |
base | 128 | 8 | fp32 | 406.67 | 24.12 | 19.86 | 19.97 | 20.41 |
base | 384 | 1 | fp32 | 98.33 | 15.28 | 10.27 | 10.32 | 10.76 |
base | 384 | 2 | fp32 | 114.92 | 26.88 | 17.55 | 17.59 | 18.29 |
base | 384 | 4 | fp32 | 125.76 | 41.74 | 32.06 | 32.23 | 33.72 |
base | 384 | 8 | fp32 | 136.62 | 69.78 | 58.95 | 59.19 | 60 |
large | 128 | 1 | fp16 | 96.46 | 15.56 | 10.56 | 10.66 | 11.02 |
large | 128 | 2 | fp16 | 168.31 | 17.42 | 12.11 | 12.25 | 12.57 |
large | 128 | 4 | fp16 | 267.76 | 24.76 | 15.17 | 15.36 | 16.68 |
large | 128 | 8 | fp16 | 378.28 | 30.34 | 21.39 | 21.54 | 21.97 |
large | 384 | 1 | fp16 | 68.75 | 26.02 | 14.77 | 14.94 | 15.3 |
large | 384 | 2 | fp16 | 95.41 | 44.01 | 21.24 | 21.47 | 22.01 |
large | 384 | 4 | fp16 | 124.43 | 55.14 | 32.53 | 32.83 | 33.58 |
large | 384 | 8 | fp16 | 143.02 | 81.37 | 56.51 | 56.88 | 58.05 |
large | 128 | 1 | fp32 | 75.34 | 17.5 | 13.46 | 13.52 | 13.7 |
large | 128 | 2 | fp32 | 99.73 | 24.7 | 20.27 | 20.38 | 21.45 |
large | 128 | 4 | fp32 | 116.92 | 42.1 | 34.49 | 34.59 | 34.98 |
large | 128 | 8 | fp32 | 130.11 | 68.95 | 62.03 | 62.23 | 63.3 |
large | 384 | 1 | fp32 | 33.84 | 38.15 | 29.75 | 29.89 | 31.23 |
large | 384 | 2 | fp32 | 38.02 | 69.31 | 53.1 | 53.36 | 54.42 |
large | 384 | 4 | fp32 | 41.2 | 114.34 | 97.96 | 98.32 | 99.55 |
large | 384 | 8 | fp32 | 42.37 | 209.16 | 190.18 | 190.66 | 192.77 |
To achieve these same results, follow the Quick Start Guide outlined above.
Our results were obtained by running the scripts/finetune_inference_benchmark.sh
training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-2 with 1x V100 32GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.
Model | Sequence Length | Batch Size | Precision | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
---|---|---|---|---|---|---|---|---|
base | 128 | 1 | fp16 | 220.35 | 7.82 | 4.7 | 4.83 | 5.15 |
base | 128 | 2 | fp16 | 384.55 | 8.7 | 5.49 | 5.68 | 6.01 |
base | 128 | 4 | fp16 | 650.7 | 36.3 | 6.35 | 6.51 | 6.87 |
base | 128 | 8 | fp16 | 992.41 | 13.59 | 8.22 | 8.37 | 8.96 |
base | 384 | 1 | fp16 | 172.89 | 12.86 | 5.94 | 6.04 | 6.44 |
base | 384 | 2 | fp16 | 258.48 | 20.42 | 7.89 | 8.09 | 9.15 |
base | 384 | 4 | fp16 | 346.34 | 24.93 | 11.97 | 12.12 | 12.76 |
base | 384 | 8 | fp16 | 430.4 | 33.08 | 18.75 | 19.27 | 20.12 |
base | 128 | 1 | fp32 | 183.69 | 7.52 | 5.86 | 5.97 | 6.27 |
base | 128 | 2 | fp32 | 282.95 | 9.51 | 7.31 | 7.49 | 7.83 |
base | 128 | 4 | fp32 | 363.83 | 15.12 | 11.35 | 11.47 | 11.74 |
base | 128 | 8 | fp32 | 449.12 | 21.65 | 18 | 18.1 | 18.6 |
base | 384 | 1 | fp32 | 104.92 | 13.8 | 9.9 | 9.99 | 10.48 |
base | 384 | 2 | fp32 | 123.55 | 24.21 | 16.29 | 16.4 | 17.61 |
base | 384 | 4 | fp32 | 139.38 | 36.69 | 28.89 | 29.04 | 30.01 |
base | 384 | 8 | fp32 | 146.28 | 64.69 | 55.09 | 55.32 | 56.3 |
large | 128 | 1 | fp16 | 98.34 | 15.85 | 10.61 | 10.78 | 11.5 |
large | 128 | 2 | fp16 | 172.95 | 17.8 | 11.91 | 12.08 | 12.42 |
large | 128 | 4 | fp16 | 278.82 | 25.18 | 14.7 | 14.87 | 15.65 |
large | 128 | 8 | fp16 | 402.28 | 30.45 | 20.21 | 20.43 | 21.24 |
large | 384 | 1 | fp16 | 71.1 | 26.55 | 14.44 | 14.61 | 15.32 |
large | 384 | 2 | fp16 | 100.48 | 44.04 | 20.31 | 20.48 | 21.6 |
large | 384 | 4 | fp16 | 131.68 | 56.19 | 30.8 | 31.03 | 32.3 |
large | 384 | 8 | fp16 | 151.81 | 81.53 | 53.22 | 53.87 | 55.34 |
large | 128 | 1 | fp32 | 77.87 | 16.33 | 13.33 | 13.45 | 13.77 |
large | 128 | 2 | fp32 | 105.41 | 22.77 | 19.39 | 19.52 | 19.86 |
large | 128 | 4 | fp32 | 124.16 | 38.61 | 32.69 | 32.88 | 33.9 |
large | 128 | 8 | fp32 | 137.69 | 64.61 | 58.62 | 58.89 | 59.94 |
large | 384 | 1 | fp32 | 36.34 | 34.94 | 27.72 | 27.81 | 28.21 |
large | 384 | 2 | fp32 | 41.11 | 62.54 | 49.14 | 49.32 | 50.25 |
large | 384 | 4 | fp32 | 43.32 | 107.53 | 93.07 | 93.47 | 94.27 |
large | 384 | 8 | fp32 | 44.64 | 196.28 | 180.21 | 180.75 | 182.41 |
Our results were obtained by running the scripts/finetune_inference_benchmark.sh
training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 1x A100 40GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.
Model | Sequence Length | Batch Size | Precision | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
---|---|---|---|---|---|---|---|---|
base | 128 | 1 | fp16 | 231.37 | 6.43 | 4.57 | 4.68 | 4.93 |
base | 128 | 2 | fp16 | 454.54 | 6.77 | 4.66 | 4.77 | 4.96 |
base | 128 | 4 | fp16 | 842.34 | 8.8 | 4.91 | 4.98 | 5.39 |
base | 128 | 8 | fp16 | 1216.43 | 10.39 | 6.77 | 6.86 | 7.28 |
base | 384 | 1 | fp16 | 210.59 | 9.03 | 4.83 | 4.86 | 5.06 |
base | 384 | 2 | fp16 | 290.91 | 14.88 | 7.09 | 7.19 | 7.72 |
base | 384 | 4 | fp16 | 407.13 | 18.04 | 9.93 | 10.05 | 10.74 |
base | 384 | 8 | fp16 | 478.67 | 26.06 | 16.92 | 17.19 | 17.76 |
base | 128 | 1 | tf32 | 223.38 | 6.94 | 4.73 | 4.86 | 5.04 |
base | 128 | 2 | tf32 | 447.57 | 7.2 | 4.68 | 4.82 | 5.07 |
base | 128 | 4 | tf32 | 838.89 | 9.16 | 4.88 | 4.93 | 5.38 |
base | 128 | 8 | tf32 | 1201.05 | 10.81 | 6.88 | 6.99 | 7.21 |
base | 384 | 1 | tf32 | 206.46 | 9.74 | 4.93 | 4.98 | 5.25 |
base | 384 | 2 | tf32 | 287 | 15.57 | 7.18 | 7.27 | 7.87 |
base | 384 | 4 | tf32 | 396.59 | 18.94 | 10.3 | 10.41 | 11.04 |
base | 384 | 8 | tf32 | 479.04 | 26.81 | 16.88 | 17.25 | 17.74 |
base | 128 | 1 | fp32 | 152.92 | 9.13 | 6.76 | 6.91 | 7.06 |
base | 128 | 2 | fp32 | 297.42 | 9.51 | 6.93 | 7.07 | 7.21 |
base | 128 | 4 | fp32 | 448.57 | 11.81 | 9.12 | 9.25 | 9.68 |
base | 128 | 8 | fp32 | 539.94 | 17.49 | 15 | 15.1 | 15.79 |
base | 384 | 1 | fp32 | 115.19 | 13.69 | 8.89 | 8.98 | 9.27 |
base | 384 | 2 | fp32 | 154.66 | 18.49 | 13.06 | 13.14 | 13.89 |
base | 384 | 4 | fp32 | 174.28 | 28.75 | 23.11 | 23.24 | 24 |
base | 384 | 8 | fp32 | 191.97 | 48.05 | 41.85 | 42.25 | 42.8 |
large | 128 | 1 | fp16 | 127.75 | 11.18 | 8.14 | 8.25 | 8.53 |
large | 128 | 2 | fp16 | 219.49 | 12.76 | 9.4 | 9.54 | 9.89 |
large | 128 | 4 | fp16 | 315.83 | 19.01 | 12.87 | 12.98 | 13.37 |
large | 128 | 8 | fp16 | 495.75 | 22.21 | 16.33 | 16.45 | 16.79 |
large | 384 | 1 | fp16 | 96.65 | 17.46 | 10.52 | 10.6 | 11 |
large | 384 | 2 | fp16 | 126.07 | 29.43 | 16.09 | 16.22 | 16.78 |
large | 384 | 4 | fp16 | 165.21 | 38.39 | 24.41 | 24.61 | 25.38 |
large | 384 | 8 | fp16 | 182.13 | 61.04 | 44.32 | 44.61 | 45.23 |
large | 128 | 1 | tf32 | 133.24 | 10.86 | 7.77 | 7.87 | 8.23 |
large | 128 | 2 | tf32 | 218.13 | 12.86 | 9.44 | 9.56 | 9.85 |
large | 128 | 4 | tf32 | 316.25 | 18.98 | 12.91 | 13.01 | 13.57 |
large | 128 | 8 | tf32 | 495.21 | 22.25 | 16.4 | 16.51 | 17.23 |
large | 384 | 1 | tf32 | 95.43 | 17.5 | 10.72 | 10.83 | 11.49 |
large | 384 | 2 | tf32 | 125.99 | 29.47 | 16.06 | 16.15 | 16.67 |
large | 384 | 4 | tf32 | 164.28 | 38.77 | 24.6 | 24.83 | 25.59 |
large | 384 | 8 | tf32 | 182.46 | 61 | 44.2 | 44.46 | 45.15 |
large | 128 | 1 | fp32 | 50.43 | 23.83 | 20.11 | 20.2 | 20.56 |
large | 128 | 2 | fp32 | 94.47 | 25.53 | 21.36 | 21.49 | 21.78 |
large | 128 | 4 | fp32 | 141.52 | 32.51 | 28.44 | 28.57 | 28.99 |
large | 128 | 8 | fp32 | 166.37 | 52.07 | 48.3 | 48.43 | 49.46 |
large | 384 | 1 | fp32 | 44.42 | 30.54 | 22.67 | 22.74 | 23.46 |
large | 384 | 2 | fp32 | 50.29 | 48.74 | 39.95 | 40.06 | 40.59 |
large | 384 | 4 | fp32 | 55.58 | 81.55 | 72.31 | 72.6 | 73.7 |
large | 384 | 8 | fp32 | 58.38 | 147.63 | 137.43 | 137.82 | 138.3 |
To achieve these same results, follow the Quick Start Guide outlined above.
Our results were obtained by running the scripts/finetune_inference_benchmark.sh
training script in the TensorFlow 20.06-py3 NGC container on NVIDIA Tesla T4 with 1x T4 16GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.
Model | Sequence Length | Batch Size | Precision | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-50%(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) | Latency-100%(ms) |
---|---|---|---|---|---|---|---|---|---|---|
base | 128 | 1 | fp16 | 91.93 | 13.94 | 10.93 | 11.41 | 11.52 | 11.94 | 5491.47 |
base | 128 | 2 | fp16 | 148.08 | 16.91 | 13.65 | 13.95 | 14.06 | 14.74 | 5757.12 |
base | 128 | 4 | fp16 | 215.45 | 24.56 | 18.68 | 18.92 | 19.08 | 19.84 | 5894.82 |
base | 128 | 8 | fp16 | 289.52 | 33.07 | 27.77 | 28.22 | 28.38 | 29.16 | 6074.47 |
base | 384 | 1 | fp16 | 60.75 | 23.18 | 16.6 | 16.93 | 17.03 | 17.45 | 7006.41 |
base | 384 | 2 | fp16 | 82.85 | 37.05 | 24.26 | 24.54 | 24.63 | 25.67 | 7529.94 |
base | 384 | 4 | fp16 | 97.78 | 54.4 | 41.02 | 41.53 | 41.94 | 43.91 | 7995.39 |
base | 384 | 8 | fp16 | 106.78 | 89.6 | 74.98 | 75.5 | 76.13 | 78.02 | 8461.93 |
base | 128 | 1 | fp32 | 54.28 | 20.88 | 18.52 | 18.8 | 18.92 | 19.29 | 4401.4 |
base | 128 | 2 | fp32 | 71.75 | 30.57 | 28.08 | 28.51 | 28.62 | 29.12 | 4573.47 |
base | 128 | 4 | fp32 | 88.01 | 50.37 | 45.61 | 45.94 | 46.14 | 47.04 | 4992.7 |
base | 128 | 8 | fp32 | 98.92 | 85.57 | 80.98 | 81.44 | 81.74 | 82.75 | 5408.97 |
base | 384 | 1 | fp32 | 25.83 | 43.63 | 38.75 | 39.33 | 39.43 | 40.02 | 5148.45 |
base | 384 | 2 | fp32 | 29.08 | 77.68 | 68.89 | 69.26 | 69.55 | 72.08 | 5462.5 |
base | 384 | 4 | fp32 | 30.33 | 141.45 | 131.86 | 132.53 | 133.14 | 136.7 | 5975.63 |
base | 384 | 8 | fp32 | 31.8 | 262.88 | 251.62 | 252.23 | 253.08 | 255.56 | 7124 |
large | 128 | 1 | fp16 | 40.31 | 30.61 | 25.14 | 25.62 | 25.87 | 27.61 | 10395.87 |
large | 128 | 2 | fp16 | 63.79 | 37.43 | 31.66 | 32.31 | 32.66 | 34.36 | 10302.2 |
large | 128 | 4 | fp16 | 87.4 | 56.5 | 45.97 | 46.6 | 47.01 | 48.71 | 10391.17 |
large | 128 | 8 | fp16 | 107.5 | 84.29 | 74.59 | 75.25 | 75.64 | 77.73 | 10945.1 |
large | 384 | 1 | fp16 | 23.05 | 55.73 | 43.72 | 44.28 | 44.74 | 46.8 | 12889.23 |
large | 384 | 2 | fp16 | 29.59 | 91.61 | 67.94 | 68.8 | 69.45 | 71.64 | 13876.35 |
large | 384 | 4 | fp16 | 34.27 | 141.56 | 116.67 | 118.02 | 119.1 | 122.1 | 14570.73 |
large | 384 | 8 | fp16 | 38.29 | 237.85 | 208.95 | 210.08 | 211.33 | 214.61 | 16626.02 |
large | 128 | 1 | fp32 | 21.52 | 50.46 | 46.48 | 47.63 | 47.94 | 49.63 | 7150.38 |
large | 128 | 2 | fp32 | 25.4 | 83.3 | 79.06 | 79.61 | 80.06 | 81.77 | 7763.11 |
large | 128 | 4 | fp32 | 28.19 | 149.49 | 142.15 | 143.1 | 143.65 | 145.43 | 7701.38 |
large | 128 | 8 | fp32 | 30.14 | 272.84 | 265.6 | 266.57 | 267.21 | 269.37 | 8246.3 |
large | 384 | 1 | fp32 | 8.46 | 126.81 | 118.44 | 119.42 | 120.31 | 122.74 | 9007.96 |
large | 384 | 2 | fp32 | 9.29 | 231 | 215.54 | 216.64 | 217.71 | 220.35 | 9755.69 |
large | 384 | 4 | fp32 | 9.55 | 436.5 | 418.71 | 420.05 | 421.27 | 424.3 | 11766.45 |
large | 384 | 8 | fp32 | 9.75 | 840.9 | 820.39 | 822.19 | 823.69 | 827.99 | 12856.99 |
To achieve these same results, follow the Quick Start Guide outlined above.