The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference). ### Benchmarking The following section shows how to run benchmarks measuring the model performance in training and inference modes. Both of these benchmarking scripts enable you to run a number of epochs, extract performance numbers, and run the BERT model for fine tuning. #### Training performance benchmark Training benchmarking can be performed by running the script: ``` bash scripts/finetune_train_benchmark.sh ``` This script runs 800 steps by default on the SQuAD v1.1 dataset and extracts performance numbers for the given configuration. These numbers are saved at `/results/squad_train_benchmark__gpu_bs.log`. #### Inference performance benchmark Inference benchmarking can be performed by running the script: ``` bash scripts/finetune_inference_benchmark.sh ``` This script runs 1000 eval iterations by default on the SQuAD v1.1 dataset and extracts performance and latency numbers for the given configuration. These numbers are saved at `/results/squad_inference_benchmark___bs.log`. ### Results The following sections provide details on how we achieved our performance and accuracy in training and inference for fine tuning Question Answering. All results are on BERT-Large model unless otherwise mentioned. All fine tuning results are on SQuAD v1.1 using a sequence length of 384 unless otherwise mentioned. #### Training accuracy results ##### Pre-training accuracy Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-2 and NVIDIA DGX A100. | **DGX System** | **Nodes x GPUs** | **Precision** | **Batch Size/GPU: Phase1, Phase2** | **Accumulation Steps: Phase1, Phase2** | **Time to Train (Hrs)** | **Final Loss** | |----------------|-----------|---------------|------------------------------------|----------------------------------------|----------------|-------------------------| | DGX2H | 32 x 16 | FP16 | 56, 10 | 2, 6 | 2.67 | 1.69 | | DGX2H | 32 x 16 | FP32 | 32, 4 | 4, 16 | 8.02 | 1.71 | | DGXA100 | 32 x 8 | FP16 | 312, 40 | 1, 3 | 2.02 | 1.68 | | DGXA100 | 32 x 8 | TF32 | 176, 22 | 2, 6 | 3.57 | 1.67 | ##### Fine-tuning accuracy for SQuAD v1.1: NVIDIA DGX A100 (8x A100 40GB) Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 20.12-py3 NGC container on NVIDIA DGX A100 with 8x A100 80GB GPUs. | **GPUs** | **Batch size / GPU: TF32, FP16 ** | **Accuracy - TF32** | **Accuracy - mixed precision** | **Time to Train - TF32 (Hrs)** | **Time to Train - mixed precision (Hrs)** | |:---:|:----:|:----:|:---:|:----:|:----:| | 8 | 38, 76 | 90.88 | 91.12 | 0.16 | 0.11 | ##### Pre-training SQuAD v1.1 stability test: NVIDIA DGX A100 (256x A100 80GB) The following tables compare `Final Loss` scores across 3 different training runs with different seeds, for both FP16 and TF32. The runs showcase consistent convergence on all 3 seeds with very little deviation. | **FP16, 256x GPUs** | **seed 1** | **seed 2** | **seed 3** | **mean** | **std** | |:-----------:|:-----:|:-----:|:-----:|:-----:|:-----:| |Final Loss |1.657 |1.661 |1.683 |1.667 |0.014 | | **TF32, 256x GPUs** | **seed 1** | **seed 2** | **seed 3** | **mean** | **std** | |:-----------:|:-----:|:-----:|:-----:|:-----:|:-----:| |Final Loss |1.67 |1.654 |1.636 |1.653 |0.017 | ##### Fine-tuning SQuAD v1.1 stability test: NVIDIA DGX A100 (8x A100 80GB) The following tables compare `F1` scores across 5 different training runs with different seeds, for both FP16 and TF32 respectively using the (NVIDIA Pretrained Checkpoint)[https://ngc.nvidia.com/catalog/models]. The runs showcase consistent convergence on all 5 seeds with very little deviation. | **FP16, 8x GPUs** | **seed 1** | **seed 2** | **seed 3** | **seed 4** | **seed 5** | **mean** | **std** | |:-----------:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:| |F1 |91.12 |90.80 |90.94 |90.90 |90.94 |90.94 |0.11 | | **TF32, 8x GPUs** | **seed 1** | **seed 2** | **seed 3** | **seed 4** | **seed 5** | **mean** | **std** | |:-----------:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:| |F1 |90.79 |90.88 |90.80 |90.88 |90.83 |90.84 |0.04 | #### Training performance results ##### Pre-training training performance: Single-node on NVIDIA DGX-2 V100 (16x V100 32GB) Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-2 with 16x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput. | **GPUs** | **Sequence Length** | **Batch size / GPU: mixed precision, FP32** | **Gradient Accumulation: mixed precision, FP32** | **Global Batch Size: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 - mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** | |:--------:|:-------------------:|:-------------------------------------------:|--------------------------------------------------|:--------------------------------------------:|:--------------------------------:|:---------------------:|-------------------------------------------------|------------------------------------|-------------------------| | 1 | 128 | 60 , 32 | 1024 , 2048 | 61440 , 65536 | 206.5 | 49.97 | 4.13 | 1.00 | 1.00 | | 4 | 128 | 60 , 32 | 256 , 512 | 61440 , 65536 | 789.75 | 194.02 | 4.07 | 3.82 | 3.88 | | 8 | 128 | 60 , 32 | 128 , 256 | 61440 , 65536 | 1561.77 | 367.9 | 4.25 | 7.56 | 7.36 | | 16 | 128 | 60 , 32 | 64 , 128 | 61440 , 65536 | 3077.99 | 762.22 | 4.04 | 14.9 | 15.25 | | 1 | 512 | 10 , 6 | 3072 , 5120 | 30720 , 30720 | 40.95 | 11.06 | 3.70 | 1.00 | 1.00 | | 4 | 512 | 10 , 6 | 768 , 1280 | 30720 , 30720 | 158.5 | 43.05 | 3.68 | 3.87 | 3.89 | | 8 | 512 | 10 , 6 | 384 , 640 | 30720 , 30720 | 312.03 | 85.51 | 3.65 | 7.62 | 7.73 | | 16 | 512 | 10 , 4 | 192 , 512 | 30720 , 32768 | 614.94 | 161.38 | 3.81 | 15.02 | 14.59 | Note: The respective values for FP32 runs that use a batch size of 60 and 10 in sequence lengths 128 and 512 are not available due to out of memory errors that arise. ##### Pre-training training performance: Multi-node on NVIDIA DGX-2H V100 (16x V100 32GB) Our results were obtained by running the `run.sub` training script in the TensorFlow 21.02-py3 NGC container using multiple NVIDIA DGX-2 with 16x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput. | **Num Nodes** | **Sequence Length** | **Batch size / GPU: mixed precision, FP32** | **Gradient Accumulation: mixed precision, FP32** | **Global Batch Size: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 - mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** | |:-------------:|:-------------------:|:-------------------------------------------:|--------------------------------------------------|:--------------------------------------------:|:--------------------------------:|:---------------------:|-------------------------------------------------|------------------------------------|-------------------------| | 1 | 128 | 60 , 32 | 64 , 128 | 61440 , 65536 | 3528.51 | 841.72 | 4.19 | 1.00 | 1.00 | | 4 | 128 | 60 , 32 | 16 , 32 | 61440 , 65536 | 13370.21 | 3060.49 | 4.37 | 3.79 | 3.64 | | 16 | 128 | 60 , 32 | 4 , 8 | 61440 , 65536 | 42697.42 | 10383.57 | 4.11 | 12.1 | 12.34 | | 32 | 128 | 60 , 32 | 2 , 4 | 61440 , 65536 | 84223.16 | 20094.14 | 4.19 | 23.87 | 23.87 | | 1 | 512 | 10 , 4 | 192 , 256 | 30720 , 32768 | 678.35 | 180 | 3.77 | 1.00 | 1.00 | | 4 | 512 | 10 , 4 | 96 , 64 | 30720 , 32768 | 2678.29 | 646.76 | 4.14 | 3.95 | 3.59 | | 16 | 512 | 10 , 4 | 24 , 32 | 30720 , 32768 | 7834.72 | 2204.72 | 3.55 | 11.55 | 12.25 | | 32 | 512 | 10 , 4 | 6 , 16 | 30720 , 32768 | 18786.93 | 4196.15 | 4.48 | 27.70 | 23.31 | Note: The respective values for FP32 runs that use a batch size of 60 and 10 in sequence lengths 128 and 512 are not available due to out of memory errors that arise. ##### Pre-training training performance: Single-node on NVIDIA DGX A100 (8x A100 80GB) Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX A100 with 8x A100 80GB GPUs. Performance (in sentences per second) is the steady state throughput. | **GPUs** | **Sequence Length** | **Batch size / GPU: mixed precision, TF32** | **Gradient Accumulation: mixed precision, TF32** | **Global Batch Size: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - TF32** | **Throughput speedup (TF32 - mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling -TF32** | |:--------:|:-------------------:|:-------------------------------------------:|--------------------------------------------------|:--------------------------------------------:|:--------------------------------:|:---------------------:|-------------------------------------------------|------------------------------------|------------------------| | 1 | 128 | 312 , 176 | 256 , 512 | 79872 , 90112 | 485.59 | 282.98 | 1.72 | 1.00 | 1.00 | | 8 | 128 | 312 , 176 | 32 , 64 | 79872 , 90112 | 3799.24 | 1944.77 | 1.95 | 7.82 | 6.87 | | 1 | 512 | 40 , 22 | 768 , 1536 | 30720 , 33792 | 96.52 | 54.92 | 1.76 | 1.00 | 1.00 | | 8 | 512 | 40 , 22 | 96 , 192 | 30720 , 33792 | 649.69 | 427.39 | 1.52 | 6.73 | 7.78 | Note: The respective values for TF32 runs that use a batch size of 312 and 40 in sequence lengths 128 and 512 are not available due to out of memory errors that arise. ##### Pre-training training performance: Multi-node on NVIDIA DGX A100 (8x A100 80GB) Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs. Performance (in sentences per second) is the steady state throughput. | **Num Nodes** | **Sequence Length** | **Batch size / GPU: mixed precision, TF32** | **Gradient Accumulation: mixed precision, TF32** | **Global Batch Size: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - TF32** | **Throughput speedup (TF32 - mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling -TF32** | |:-------------:|:-------------------:|:-------------------------------------------:|--------------------------------------------------|:--------------------------------------------:|:--------------------------------:|:---------------------:|-------------------------------------------------|------------------------------------|------------------------| | 1 | 128 | 312 , 176 | 32 , 64 | 79872 , 90112 | 3803.82 | 2062.98 | 1.84 |1.00 | 1.00 | | 2 | 128 | 312 , 176 | 16 , 32 | 79872 , 90112 | 7551.37 | 4084.76 | 1.85 |1.99 | 1.98 | | 8 | 128 | 312 , 176 | 4 , 8 | 79872 , 90112 | 29711.11 | 16134.02 | 1.84 |7.81 | 7.82 | | 32 | 128 | 312 , 176 | 1 , 2 | 79872 , 90112 | 110280.73 | 59569.77 | 1.85 |28.99 | 28.88 | | 1 | 512 | 40 , 22 | 96 , 192 | 30720 , 33792 | 749.73 | 431.89 | 1.74 |1.00 | 1.00 | | 2 | 512 | 40 , 22 | 48 , 96 | 30720 , 33792 | 1491.87 | 739.14 | 2.02 |1.99 | 1.71 | | 8 | 512 | 40 , 22 | 12 , 24 | 30720 , 33792 | 5870.83 | 2926.58 | 2.01 |7.83 | 6.78 | | 32 | 512 | 40 , 22 | 3 , 6 | 30720 , 33792 | 22506.23 | 11240.5 | 2.00 |30.02 | 26.03 | Note: The respective values for TF32 runs that use a batch size of 312 and 40 in sequence lengths 128 and 512 are not available due to out of memory errors that arise. ##### Fine-tuning training performance for SQuAD v1.1 on NVIDIA DGX-1 V100 (8x V100 16GB) Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs. | **GPUs** | **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** | |:---:|:---:|:------:|:-----:|:----:|:----:|:----:| | 1 | 6,3 | 39.10 | 9.85 | 3.97 | 1.00 | 1.00 | | 4 | 6,3 | 128.48 | 36.52 | 3.52 | 3.29 | 3.71 | | 8 | 6,3 | 255.36 | 73.03 | 3.5 | 6.53 | 7.41 | Note: The respective values for FP32 runs that use a batch size of 6 are not available due to out of memory errors that arise. Batch size of 6 is only available on using FP16. To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above. ##### Fine-tuning training performance for SQuAD v1.1 on NVIDIA DGX-1 V100 (8x V100 32GB) Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-1 with 8x V100 32GB GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs. | **GPUs** | **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** | |---|---|-----|------|----|----|----| | 1 | 12,8 | 47.06 | 11.11 | 4.24 | 1.00 | 1.00 | | 4 | 12,8 | 165.26 | 42.84 | 3.86 | 3.51 | 3.86 | | 8 | 12,8 | 330.29 | 85.91 | 3.84 | 7.02 | 7.73 | Note: The respective values for FP32 runs that use a batch size of 12 are not available due to out of memory errors that arise. Batch size of 12 is only available on using FP16. To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above. ##### Fine-tuning training performance for SQuAD v1.1 on NVIDIA DGX A100 (8x A100 80GB) Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-2 with 16x V100 32GB GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs. | **GPUs** | **Batch size / GPU: mixed precision, TF32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** | |---|---|------|------|----|-----|-----| | 1 | 76,38 | 134.22 | 43.9 | 3.057 | 1.00 | 1.00 | | 8 | 76,38 | 1048.23 | 341.31 | 3.071 | 7.81 | 7.77 | Note: The respective values for TF32 runs that use a batch size of 76 are not available due to out of memory errors that arise. Batch size of 12 is only available on using FP16. To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above. #### Inference performance results ##### Fine-tuning inference performance for SQuAD v1.1 on NVIDIA DGX-1 V100 (1x V100 16GB) Our results were obtained by running the `scripts/finetune_inference_benchmark.sh` training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1000 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining. BERT-LARGE FP16 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) | |-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------| | 128 | 1 | 105.04 | 1.277237354 | 9.52 | 9.67 | 9.77 | 10.16 | | 128 | 2 | 184.9 | 1.671487977 | 10.82 | 11.15 | 11.27 | 11.8 | | 128 | 4 | 301.9 | 2.448102498 | 13.25 | 13.38 | 13.45 | 13.96 | | 128 | 8 | 421.98 | 3.149809659 | 18.96 | 19.12 | 19.2 | 19.82 | | 384 | 1 | 74.99 | 2.15055922 | 13.34 | 13.5 | 13.58 | 14.53 | | 384 | 2 | 109.84 | 2.709422792 | 18.21 | 18.4 | 18.6 | 19.39 | | 384 | 4 | 142.58 | 3.313502208 | 28.05 | 28.28 | 28.48 | 28.85 | | 384 | 8 | 168.34 | 3.823302294 | 47.52 | 47.74 | 47.86 | 48.52 | BERT-Large FP32 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) | |-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------| | 128 | 1 | 82.24 | 12.16 | 12.28 | 12.33 | 12.92 | | 128 | 2 | 110.62 | 18.08 | 18.22 | 18.28 | 18.88 | | 128 | 4 | 123.32 | 32.44 | 32.72 | 32.82 | 32.98 | | 128 | 8 | 133.97 | 59.71 | 60.29 | 60.49 | 60.69 | | 384 | 1 | 34.87 | 28.67 | 28.92 | 29.02 | 29.33 | | 384 | 2 | 40.54 | 49.34 | 49.74 | 49.86 | 50.05 | | 384 | 4 | 43.03 | 92.97 | 93.59 | 93.75 | 94.57 | | 384 | 8 | 44.03 | 181.71 | 182.34 | 182.48 | 183.03 | BERT-Base FP16 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) | |-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------| | 128 | 1 | 236.26 | 1.179589595 | 4.23 | 4.37 | 4.49 | 4.59 | | 128 | 2 | 425.1 | 1.441554478 | 4.7 | 4.84 | 4.97 | 5.26 | | 128 | 4 | 710.48 | 1.911691107 | 5.63 | 5.78 | 5.93 | 6.4 | | 128 | 8 | 1081.17 | 2.523032764 | 7.4 | 7.5 | 7.54 | 7.73 | | 384 | 1 | 190.53 | 1.757170525 | 5.25 | 5.35 | 5.42 | 5.8 | | 384 | 2 | 289.67 | 2.248292456 | 6.9 | 7.08 | 7.24 | 7.57 | | 384 | 4 | 404.03 | 2.946328302 | 9.9 | 10 | 10.03 | 10.13 | | 384 | 8 | 504.24 | 3.450153951 | 15.87 | 15.96 | 16.01 | 16.3 | BERT-Base FP32 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) | |-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------| | 128 | 1 | 200.29 | 4.99 | 5.08 | 5.16 | 5.53 | | 128 | 2 | 294.89 | 6.78 | 6.89 | 6.93 | 7.37 | | 128 | 4 | 371.65 | 10.76 | 10.89 | 10.96 | 11.92 | | 128 | 8 | 428.52 | 18.67 | 18.89 | 18.98 | 19.17 | | 384 | 1 | 108.43 | 9.22 | 9.26 | 9.31 | 10.24 | | 384 | 2 | 128.84 | 15.52 | 15.6 | 15.71 | 16.49 | | 384 | 4 | 137.13 | 29.17 | 29.4 | 29.48 | 29.64 | | 384 | 8 | 146.15 | 54.74 | 55.19 | 55.3 | 55.54 | To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above. ##### Fine-tuning inference performance for SQuAD v1.1 on NVIIDA DGX-1 V100 (1x V100 32GB) Our results were obtained by running the `scripts/finetune_inference_benchmark.sh` training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-1 with 1x V100 32GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1000 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining. BERTLarge FP16 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) | |-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------| | 128 | 1 | 101.58 | 1.242112986 | 9.84 | 9.99 | 10.06 | 10.39 | | 128 | 2 | 181.89 | 1.651593571 | 11 | 11.14 | 11.2 | 11.87 | | 128 | 4 | 295.86 | 2.348840902 | 13.52 | 13.67 | 13.75 | 14.5 | | 128 | 8 | 411.29 | 3.010246652 | 19.45 | 19.62 | 19.69 | 20.4 | | 384 | 1 | 72.95 | 2.083690374 | 13.71 | 13.93 | 14.08 | 14.81 | | 384 | 2 | 107.02 | 2.583775954 | 18.69 | 18.8 | 18.88 | 19.57 | | 384 | 4 | 139.8 | 3.14652262 | 28.61 | 28.75 | 28.88 | 29.6 | | 384 | 8 | 163.68 | 3.595782074 | 48.88 | 49.09 | 49.18 | 49.77 | BERT-Large FP32 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) | |-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------| | 128 | 1 | 81.78 | 12.23 | 12.37 | 12.43 | 13.2 | | 128 | 2 | 110.13 | 18.16 | 18.29 | 18.37 | 19.27 | | 128 | 4 | 125.96 | 31.76 | 32.09 | 32.21 | 32.42 | | 128 | 8 | 136.63 | 58.55 | 58.93 | 59.05 | 59.4 | | 384 | 1 | 35.01 | 28.56 | 28.81 | 28.94 | 29.16 | | 384 | 2 | 41.42 | 48.29 | 48.57 | 48.67 | 49.02 | | 384 | 4 | 44.43 | 90.03 | 90.43 | 90.59 | 90.89 | | 384 | 8 | 45.52 | 175.76 | 176.66 | 176.89 | 177.33 | BERT-Base FP16 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) | |-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------| | 128 | 1 | 234.85 | 1.217533309 | 4.26 | 4.33 | 4.37 | 4.62 | | 128 | 2 | 415.86 | 1.435782351 | 4.81 | 4.92 | 5.06 | 5.55 | | 128 | 4 | 680.09 | 1.84912586 | 5.88 | 6.1 | 6.2 | 6.53 | | 128 | 8 | 1030.03 | 2.264548752 | 7.77 | 7.87 | 7.95 | 8.53 | | 384 | 1 | 183.18 | 1.700993593 | 5.46 | 5.56 | 5.61 | 5.93 | | 384 | 2 | 275.77 | 2.175528558 | 7.25 | 7.38 | 7.44 | 7.89 | | 384 | 4 | 385.61 | 2.778570399 | 10.37 | 10.56 | 10.63 | 11.1 | | 384 | 8 | 488.45 | 3.292329469 | 16.38 | 16.48 | 16.52 | 16.64 | BERT-Base FP32 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) | |-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------| | 128 | 1 | 192.89 | 5.18 | 5.3 | 5.36 | 5.65 | | 128 | 2 | 289.64 | 6.91 | 7 | 7.22 | 7.83 | | 128 | 4 | 367.79 | 10.88 | 10.98 | 11.02 | 11.59 | | 128 | 8 | 454.85 | 17.59 | 17.76 | 17.81 | 17.92 | | 384 | 1 | 107.69 | 9.29 | 9.37 | 9.42 | 9.88 | | 384 | 2 | 126.76 | 15.78 | 15.89 | 15.97 | 16.72 | | 384 | 4 | 138.78 | 28.82 | 28.98 | 29.06 | 29.88 | | 384 | 8 | 148.36 | 53.92 | 54.16 | 54.26 | 54.58 | To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above. ##### Fine-tuning inference performance for SQuAD v1.1 on NVIDIA DGX A100 (1x A100 80GB) Our results were obtained by running the `scripts/finetune_inference_benchmark.sh` training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-2 with 1x V100 32GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1000 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining. BERT-Large FP16 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) | |-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------| | 128 | 1 | 145.21 | 0.9435347628 | 6.89 | 7.14 | 7.4 | 8.35 | | 128 | 2 | 272.81 | 1.093953003 | 7.33 | 7.61 | 7.77 | 8.35 | | 128 | 4 | 468.98 | 1.273087573 | 8.53 | 8.71 | 8.83 | 9.85 | | 128 | 8 | 705.67 | 1.191627687 | 11.34 | 11.64 | 11.9 | 13.1 | | 384 | 1 | 118.34 | 1.042459479 | 8.45 | 8.82 | 8.99 | 9.52 | | 384 | 2 | 197.8 | 1.231478023 | 10.11 | 10.48 | 10.62 | 11.4 | | 384 | 4 | 275.19 | 1.268332027 | 14.54 | 14.73 | 14.8 | 16.8 | | 384 | 8 | 342.22 | 1.416004634 | 23.38 | 23.64 | 23.75 | 24.1 | BERT-Large TF32 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) | |-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------| | 128 | 1 | 153.9 | 6.5 | 6.76 | 6.86 | 7.4 | | 128 | 2 | 249.38 | 8.02 | 8.22 | 8.34 | 9.45 | | 128 | 4 | 368.38 | 10.86 | 11.11 | 11.24 | 12.76 | | 128 | 8 | 592.19 | 13.51 | 13.64 | 13.77 | 15.85 | | 384 | 1 | 113.52 | 8.81 | 9.02 | 9.16 | 10.19 | | 384 | 2 | 160.62 | 12.45 | 12.61 | 12.68 | 14.47 | | 384 | 4 | 216.97 | 18.44 | 18.6 | 18.7 | 18.84 | | 384 | 8 | 241.68 | 33.1 | 33.29 | 33.36 | 33.5 | BERT-Base FP16 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) | |-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------| | 128 | 1 | 295.01 | 1.014023992 | 3.39 | 3.59 | 3.65 | 3.73 | | 128 | 2 | 594.81 | 1.048455898 | 3.36 | 3.59 | 3.68 | 4.19 | | 128 | 4 | 1043.12 | 1.005145599 | 3.83 | 3.97 | 4.2 | 4.44 | | 128 | 8 | 1786.25 | 1.198278638 | 4.48 | 4.73 | 4.8 | 5.19 | | 384 | 1 | 278.85 | 1.103395062 | 3.59 | 3.67 | 3.99 | 4.15 | | 384 | 2 | 464.77 | 1.252006896 | 4.3 | 4.59 | 4.87 | 5.29 | | 384 | 4 | 675.82 | 1.264822578 | 5.92 | 6.15 | 6.27 | 6.94 | | 384 | 8 | 846.81 | 1.31109494 | 9.45 | 9.65 | 9.74 | 11.03 | BERT-Base TF32 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) | |-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------| | 128 | 1 | 290.93 | 3.44 | 3.61 | 3.73 | 4.69 | | 128 | 2 | 567.32 | 3.53 | 3.64 | 3.96 | 5.01 | | 128 | 4 | 1037.78 | 3.85 | 3.95 | 4.06 | 4.58 | | 128 | 8 | 1490.68 | 5.37 | 5.61 | 5.66 | 6.19 | | 384 | 1 | 252.72 | 3.96 | 3.96 | 4.52 | 4.66 | | 384 | 2 | 371.22 | 5.39 | 5.64 | 5.71 | 6.38 | | 384 | 4 | 534.32 | 7.49 | 7.69 | 7.76 | 8.56 | | 384 | 8 | 645.88 | 12.39 | 12.61 | 12.67 | 12.77 | To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above. ##### Fine-tuning inference performance for SQuAD v1.1 on NVIDIA Tesla T4 (1x T4 16GB) Our results were obtained by running the `scripts/finetune_inference_benchmark.sh` training script in the TensorFlow 21.02-py3 NGC container on NVIDIA Tesla T4 with 1x T4 16GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1000 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining. BERT-Large FP16 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) | |-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------| | 128 | 1 | 57.6 | 1.364605544 | 17.36 | 18.16 | 19.02 | 21.67 | | 128 | 2 | 102.76 | 2.17988969 | 19.46 | 20.68 | 21.27 | 22.2 | | 128 | 4 | 151.11 | 3.146813828 | 26.47 | 26.9 | 27.06 | 27.45 | | 128 | 8 | 186.99 | 3.733080455 | 42.78 | 43.87 | 44.18 | 44.78 | | 384 | 1 | 38.88 | 2.590273151 | 25.72 | 26.06 | 26.16 | 26.38 | | 384 | 2 | 50.53 | 3.202154626 | 39.58 | 39.93 | 40.35 | 40.95 | | 384 | 4 | 57.69 | 3.721935484 | 69.34 | 70.5 | 70.77 | 71.09 | | 384 | 8 | 62.99 | 3.927057357 | 127 | 129.18 | 130.07 | 131.86 | BERT-Large FP32 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) | |-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------| | 128 | 1 | 42.21 | 23.69 | 24.8 | 25.02 | 25.48 | | 128 | 2 | 47.14 | 42.42 | 43.48 | 43.63 | 44.32 | | 128 | 4 | 48.02 | 83.29 | 84.37 | 84.68 | 85.14 | | 128 | 8 | 50.09 | 159.72 | 161.66 | 161.97 | 162.52 | | 384 | 1 | 15.01 | 66.63 | 67.76 | 68.08 | 68.66 | | 384 | 2 | 15.78 | 126.78 | 128.21 | 128.58 | 129.08 | | 384 | 4 | 15.5 | 258.1 | 261.01 | 261.66 | 262.55 | | 384 | 8 | 16.04 | 498.61 | 504.29 | 504.74 | 505.55 | BERT-Base FP16 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) | |-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------| | 128 | 1 | 116.56 | 1.039878669 | 8.58 | 9.53 | 10.84 | 11.74 | | 128 | 2 | 238.62 | 1.675937632 | 8.38 | 9.09 | 9.27 | 12.33 | | 128 | 4 | 402.93 | 2.440964439 | 9.93 | 10.07 | 10.13 | 12.17 | | 128 | 8 | 532.56 | 3.052619512 | 15.02 | 15.43 | 15.6 | 16.52 | | 384 | 1 | 102.12 | 2.035073735 | 9.79 | 11.06 | 11.18 | 12.07 | | 384 | 2 | 149.3 | 2.910898811 | 13.4 | 13.54 | 13.62 | 14.36 | | 384 | 4 | 177.78 | 3.563439567 | 22.5 | 23.11 | 23.27 | 23.59 | | 384 | 8 | 192.61 | 3.752386519 | 41.53 | 42.67 | 42.81 | 43.31 | BERT-Base FP32 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) | |-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------| | 128 | 1 | 112.09 | 8.92 | 9.12 | 9.49 | 10.93 | | 128 | 2 | 142.38 | 14.05 | 14.34 | 14.48 | 15.03 | | 128 | 4 | 165.07 | 24.23 | 24.86 | 24.92 | 25.05 | | 128 | 8 | 174.46 | 45.86 | 46.71 | 46.8 | 47.2 | | 384 | 1 | 50.18 | 19.93 | 20.53 | 21.04 | 21.73 | | 384 | 2 | 51.29 | 38.99 | 39.68 | 39.93 | 40.2 | | 384 | 4 | 49.89 | 80.18 | 81.54 | 82 | 82.65 | | 384 | 8 | 51.33 | 155.85 | 158.11 | 158.5 | 159.17 | To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.