NGC | Catalog
CatalogResourcesBERT for TensorFlow2

BERT for TensorFlow2

For downloads and more information, please view on a desktop device.
Logo for BERT for TensorFlow2

Description

BERT is a method of pre-training language representations which obtains state-of-the-art results on a wide array of NLP tasks.

Publisher

NVIDIA

Latest Version

21.02.3

Modified

April 4, 2023

Compressed Size

1.28 MB

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Both of these benchmarking scripts enable you to run a number of epochs, extract performance numbers, and run the BERT model for fine tuning.

Training performance benchmark

Training benchmarking can be performed by running the script:

scripts/finetune_train_benchmark.sh <bert_model> <num_gpu> <batch_size> <precision> <use_xla>

This script runs 800 steps by default on the SQuAD v1.1 dataset and extracts performance numbers for the given configuration. These numbers are saved at /results/squad_train_benchmark_<bert_model>_gpu<num_gpu>_bs<batch_size>.log.

Inference performance benchmark

Inference benchmarking can be performed by running the script:

scripts/finetune_inference_benchmark.sh <bert_model> <batch_size> <precision> <use_xla>

This script runs 1000 eval iterations by default on the SQuAD v1.1 dataset and extracts performance and latency numbers for the given configuration. These numbers are saved at /results/squad_inference_benchmark_<bert_model>_<precision>_bs<batch_size>.log.

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference for fine tuning Question Answering. All results are on BERT-Large model unless otherwise mentioned. All fine tuning results are on SQuAD v1.1 using a sequence length of 384 unless otherwise mentioned.

Training accuracy results

Pre-training accuracy

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-2 and NVIDIA DGX A100.

DGX System Nodes x GPUs Precision Batch Size/GPU: Phase1, Phase2 Accumulation Steps: Phase1, Phase2 Time to Train (Hrs) Final Loss
DGX2H 32 x 16 FP16 56, 10 2, 6 2.67 1.69
DGX2H 32 x 16 FP32 32, 4 4, 16 8.02 1.71
DGXA100 32 x 8 FP16 312, 40 1, 3 2.02 1.68
DGXA100 32 x 8 TF32 176, 22 2, 6 3.57 1.67
Fine-tuning accuracy for SQuAD v1.1: NVIDIA DGX A100 (8x A100 40GB)

Our results were obtained by running the scripts/run_squad.sh training script in the TensorFlow 20.12-py3 NGC container on NVIDIA DGX A100 with 8x A100 80GB GPUs.

GPUs **Batch size / GPU: TF32, FP16 ** Accuracy - TF32 Accuracy - mixed precision Time to Train - TF32 (Hrs) Time to Train - mixed precision (Hrs)
8 38, 76 90.88 91.12 0.16 0.11
Pre-training SQuAD v1.1 stability test: NVIDIA DGX A100 (256x A100 80GB)

The following tables compare Final Loss scores across 3 different training runs with different seeds, for both FP16 and TF32. The runs showcase consistent convergence on all 3 seeds with very little deviation.

FP16, 256x GPUs seed 1 seed 2 seed 3 mean std
Final Loss 1.657 1.661 1.683 1.667 0.014
TF32, 256x GPUs seed 1 seed 2 seed 3 mean std
Final Loss 1.67 1.654 1.636 1.653 0.017
Fine-tuning SQuAD v1.1 stability test: NVIDIA DGX A100 (8x A100 80GB)

The following tables compare F1 scores across 5 different training runs with different seeds, for both FP16 and TF32 respectively using the (NVIDIA Pretrained Checkpoint)[https://ngc.nvidia.com/catalog/models]. The runs showcase consistent convergence on all 5 seeds with very little deviation.

FP16, 8x GPUs seed 1 seed 2 seed 3 seed 4 seed 5 mean std
F1 91.12 90.80 90.94 90.90 90.94 90.94 0.11
TF32, 8x GPUs seed 1 seed 2 seed 3 seed 4 seed 5 mean std
F1 90.79 90.88 90.80 90.88 90.83 90.84 0.04

Training performance results

Pre-training training performance: Single-node on NVIDIA DGX-2 V100 (16x V100 32GB)

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-2 with 16x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUs Sequence Length Batch size / GPU: mixed precision, FP32 Gradient Accumulation: mixed precision, FP32 Global Batch Size: mixed precision, FP32 Throughput - mixed precision Throughput - FP32 Throughput speedup (FP32 - mixed precision) Weak scaling - mixed precision Weak scaling - FP32
1 128 60 , 32 1024 , 2048 61440 , 65536 206.5 49.97 4.13 1.00 1.00
4 128 60 , 32 256 , 512 61440 , 65536 789.75 194.02 4.07 3.82 3.88
8 128 60 , 32 128 , 256 61440 , 65536 1561.77 367.9 4.25 7.56 7.36
16 128 60 , 32 64 , 128 61440 , 65536 3077.99 762.22 4.04 14.9 15.25
1 512 10 , 6 3072 , 5120 30720 , 30720 40.95 11.06 3.70 1.00 1.00
4 512 10 , 6 768 , 1280 30720 , 30720 158.5 43.05 3.68 3.87 3.89
8 512 10 , 6 384 , 640 30720 , 30720 312.03 85.51 3.65 7.62 7.73
16 512 10 , 4 192 , 512 30720 , 32768 614.94 161.38 3.81 15.02 14.59

Note: The respective values for FP32 runs that use a batch size of 60 and 10 in sequence lengths 128 and 512 are not available due to out of memory errors that arise.

Pre-training training performance: Multi-node on NVIDIA DGX-2H V100 (16x V100 32GB)

Our results were obtained by running the run.sub training script in the TensorFlow 21.02-py3 NGC container using multiple NVIDIA DGX-2 with 16x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.

Num Nodes Sequence Length Batch size / GPU: mixed precision, FP32 Gradient Accumulation: mixed precision, FP32 Global Batch Size: mixed precision, FP32 Throughput - mixed precision Throughput - FP32 Throughput speedup (FP32 - mixed precision) Weak scaling - mixed precision Weak scaling - FP32
1 128 60 , 32 64 , 128 61440 , 65536 3528.51 841.72 4.19 1.00 1.00
4 128 60 , 32 16 , 32 61440 , 65536 13370.21 3060.49 4.37 3.79 3.64
16 128 60 , 32 4 , 8 61440 , 65536 42697.42 10383.57 4.11 12.1 12.34
32 128 60 , 32 2 , 4 61440 , 65536 84223.16 20094.14 4.19 23.87 23.87
1 512 10 , 4 192 , 256 30720 , 32768 678.35 180 3.77 1.00 1.00
4 512 10 , 4 96 , 64 30720 , 32768 2678.29 646.76 4.14 3.95 3.59
16 512 10 , 4 24 , 32 30720 , 32768 7834.72 2204.72 3.55 11.55 12.25
32 512 10 , 4 6 , 16 30720 , 32768 18786.93 4196.15 4.48 27.70 23.31

Note: The respective values for FP32 runs that use a batch size of 60 and 10 in sequence lengths 128 and 512 are not available due to out of memory errors that arise.

Pre-training training performance: Single-node on NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX A100 with 8x A100 80GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUs Sequence Length Batch size / GPU: mixed precision, TF32 Gradient Accumulation: mixed precision, TF32 Global Batch Size: mixed precision, FP32 Throughput - mixed precision Throughput - TF32 Throughput speedup (TF32 - mixed precision) Weak scaling - mixed precision Weak scaling -TF32
1 128 312 , 176 256 , 512 79872 , 90112 485.59 282.98 1.72 1.00 1.00
8 128 312 , 176 32 , 64 79872 , 90112 3799.24 1944.77 1.95 7.82 6.87
1 512 40 , 22 768 , 1536 30720 , 33792 96.52 54.92 1.76 1.00 1.00
8 512 40 , 22 96 , 192 30720 , 33792 649.69 427.39 1.52 6.73 7.78

Note: The respective values for TF32 runs that use a batch size of 312 and 40 in sequence lengths 128 and 512 are not available due to out of memory errors that arise.

Pre-training training performance: Multi-node on NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs. Performance (in sentences per second) is the steady state throughput.

Num Nodes Sequence Length Batch size / GPU: mixed precision, TF32 Gradient Accumulation: mixed precision, TF32 Global Batch Size: mixed precision, FP32 Throughput - mixed precision Throughput - TF32 Throughput speedup (TF32 - mixed precision) Weak scaling - mixed precision Weak scaling -TF32
1 128 312 , 176 32 , 64 79872 , 90112 3803.82 2062.98 1.84 1.00 1.00
2 128 312 , 176 16 , 32 79872 , 90112 7551.37 4084.76 1.85 1.99 1.98
8 128 312 , 176 4 , 8 79872 , 90112 29711.11 16134.02 1.84 7.81 7.82
32 128 312 , 176 1 , 2 79872 , 90112 110280.73 59569.77 1.85 28.99 28.88
1 512 40 , 22 96 , 192 30720 , 33792 749.73 431.89 1.74 1.00 1.00
2 512 40 , 22 48 , 96 30720 , 33792 1491.87 739.14 2.02 1.99 1.71
8 512 40 , 22 12 , 24 30720 , 33792 5870.83 2926.58 2.01 7.83 6.78
32 512 40 , 22 3 , 6 30720 , 33792 22506.23 11240.5 2.00 30.02 26.03

Note: The respective values for TF32 runs that use a batch size of 312 and 40 in sequence lengths 128 and 512 are not available due to out of memory errors that arise.

Fine-tuning training performance for SQuAD v1.1 on NVIDIA DGX-1 V100 (8x V100 16GB)

Our results were obtained by running the scripts/run_squad.sh training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.

GPUs Batch size / GPU: mixed precision, FP32 Throughput - mixed precision Throughput - FP32 Throughput speedup (FP32 to mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
1 6,3 39.10 9.85 3.97 1.00 1.00
4 6,3 128.48 36.52 3.52 3.29 3.71
8 6,3 255.36 73.03 3.5 6.53 7.41

Note: The respective values for FP32 runs that use a batch size of 6 are not available due to out of memory errors that arise. Batch size of 6 is only available on using FP16.

To achieve these same results, follow the Quick Start Guide outlined above.

Fine-tuning training performance for SQuAD v1.1 on NVIDIA DGX-1 V100 (8x V100 32GB)

Our results were obtained by running the scripts/run_squad.sh training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-1 with 8x V100 32GB GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.

GPUs Batch size / GPU: mixed precision, FP32 Throughput - mixed precision Throughput - FP32 Throughput speedup (FP32 to mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
1 12,8 47.06 11.11 4.24 1.00 1.00
4 12,8 165.26 42.84 3.86 3.51 3.86
8 12,8 330.29 85.91 3.84 7.02 7.73

Note: The respective values for FP32 runs that use a batch size of 12 are not available due to out of memory errors that arise. Batch size of 12 is only available on using FP16.

To achieve these same results, follow the Quick Start Guide outlined above.

Fine-tuning training performance for SQuAD v1.1 on NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the scripts/run_squad.sh training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-2 with 16x V100 32GB GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.

GPUs Batch size / GPU: mixed precision, TF32 Throughput - mixed precision Throughput - FP32 Throughput speedup (FP32 to mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
1 76,38 134.22 43.9 3.057 1.00 1.00
8 76,38 1048.23 341.31 3.071 7.81 7.77

Note: The respective values for TF32 runs that use a batch size of 76 are not available due to out of memory errors that arise. Batch size of 12 is only available on using FP16.

To achieve these same results, follow the Quick Start Guide outlined above.

Inference performance results

Fine-tuning inference performance for SQuAD v1.1 on NVIDIA DGX-1 V100 (1x V100 16GB)

Our results were obtained by running the scripts/finetune_inference_benchmark.sh training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1000 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

BERT-LARGE FP16

Sequence Length Batch Size Throughput-Average(sent/sec) Throughput speedup (FP32 to mixed precision) Latency-Average(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms)
128 1 105.04 1.277237354 9.52 9.67 9.77 10.16
128 2 184.9 1.671487977 10.82 11.15 11.27 11.8
128 4 301.9 2.448102498 13.25 13.38 13.45 13.96
128 8 421.98 3.149809659 18.96 19.12 19.2 19.82
384 1 74.99 2.15055922 13.34 13.5 13.58 14.53
384 2 109.84 2.709422792 18.21 18.4 18.6 19.39
384 4 142.58 3.313502208 28.05 28.28 28.48 28.85
384 8 168.34 3.823302294 47.52 47.74 47.86 48.52

BERT-Large FP32

Sequence Length Batch Size Throughput-Average(sent/sec) Latency-Average(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms)
128 1 82.24 12.16 12.28 12.33 12.92
128 2 110.62 18.08 18.22 18.28 18.88
128 4 123.32 32.44 32.72 32.82 32.98
128 8 133.97 59.71 60.29 60.49 60.69
384 1 34.87 28.67 28.92 29.02 29.33
384 2 40.54 49.34 49.74 49.86 50.05
384 4 43.03 92.97 93.59 93.75 94.57
384 8 44.03 181.71 182.34 182.48 183.03

BERT-Base FP16

Sequence Length Batch Size Throughput-Average(sent/sec) Throughput speedup (FP32 to mixed precision) Latency-Average(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms)
128 1 236.26 1.179589595 4.23 4.37 4.49 4.59
128 2 425.1 1.441554478 4.7 4.84 4.97 5.26
128 4 710.48 1.911691107 5.63 5.78 5.93 6.4
128 8 1081.17 2.523032764 7.4 7.5 7.54 7.73
384 1 190.53 1.757170525 5.25 5.35 5.42 5.8
384 2 289.67 2.248292456 6.9 7.08 7.24 7.57
384 4 404.03 2.946328302 9.9 10 10.03 10.13
384 8 504.24 3.450153951 15.87 15.96 16.01 16.3

BERT-Base FP32

Sequence Length Batch Size Throughput-Average(sent/sec) Latency-Average(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms)
128 1 200.29 4.99 5.08 5.16 5.53
128 2 294.89 6.78 6.89 6.93 7.37
128 4 371.65 10.76 10.89 10.96 11.92
128 8 428.52 18.67 18.89 18.98 19.17
384 1 108.43 9.22 9.26 9.31 10.24
384 2 128.84 15.52 15.6 15.71 16.49
384 4 137.13 29.17 29.4 29.48 29.64
384 8 146.15 54.74 55.19 55.3 55.54

To achieve these same results, follow the Quick Start Guide outlined above.

Fine-tuning inference performance for SQuAD v1.1 on NVIIDA DGX-1 V100 (1x V100 32GB)

Our results were obtained by running the scripts/finetune_inference_benchmark.sh training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-1 with 1x V100 32GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1000 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

BERTLarge FP16

Sequence Length Batch Size Throughput-Average(sent/sec) Throughput speedup (FP32 to mixed precision) Latency-Average(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms)
128 1 101.58 1.242112986 9.84 9.99 10.06 10.39
128 2 181.89 1.651593571 11 11.14 11.2 11.87
128 4 295.86 2.348840902 13.52 13.67 13.75 14.5
128 8 411.29 3.010246652 19.45 19.62 19.69 20.4
384 1 72.95 2.083690374 13.71 13.93 14.08 14.81
384 2 107.02 2.583775954 18.69 18.8 18.88 19.57
384 4 139.8 3.14652262 28.61 28.75 28.88 29.6
384 8 163.68 3.595782074 48.88 49.09 49.18 49.77

BERT-Large FP32

Sequence Length Batch Size Throughput-Average(sent/sec) Latency-Average(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms)
128 1 81.78 12.23 12.37 12.43 13.2
128 2 110.13 18.16 18.29 18.37 19.27
128 4 125.96 31.76 32.09 32.21 32.42
128 8 136.63 58.55 58.93 59.05 59.4
384 1 35.01 28.56 28.81 28.94 29.16
384 2 41.42 48.29 48.57 48.67 49.02
384 4 44.43 90.03 90.43 90.59 90.89
384 8 45.52 175.76 176.66 176.89 177.33

BERT-Base FP16

Sequence Length Batch Size Throughput-Average(sent/sec) Throughput speedup (FP32 to mixed precision) Latency-Average(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms)
128 1 234.85 1.217533309 4.26 4.33 4.37 4.62
128 2 415.86 1.435782351 4.81 4.92 5.06 5.55
128 4 680.09 1.84912586 5.88 6.1 6.2 6.53
128 8 1030.03 2.264548752 7.77 7.87 7.95 8.53
384 1 183.18 1.700993593 5.46 5.56 5.61 5.93
384 2 275.77 2.175528558 7.25 7.38 7.44 7.89
384 4 385.61 2.778570399 10.37 10.56 10.63 11.1
384 8 488.45 3.292329469 16.38 16.48 16.52 16.64

BERT-Base FP32

Sequence Length Batch Size Throughput-Average(sent/sec) Latency-Average(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms)
128 1 192.89 5.18 5.3 5.36 5.65
128 2 289.64 6.91 7 7.22 7.83
128 4 367.79 10.88 10.98 11.02 11.59
128 8 454.85 17.59 17.76 17.81 17.92
384 1 107.69 9.29 9.37 9.42 9.88
384 2 126.76 15.78 15.89 15.97 16.72
384 4 138.78 28.82 28.98 29.06 29.88
384 8 148.36 53.92 54.16 54.26 54.58

To achieve these same results, follow the Quick Start Guide outlined above.

Fine-tuning inference performance for SQuAD v1.1 on NVIDIA DGX A100 (1x A100 80GB)

Our results were obtained by running the scripts/finetune_inference_benchmark.sh training script in the TensorFlow 21.02-py3 NGC container on NVIDIA DGX-2 with 1x V100 32GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1000 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

BERT-Large FP16

Sequence Length Batch Size Throughput-Average(sent/sec) Throughput speedup (FP32 to mixed precision) Latency-Average(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms)
128 1 145.21 0.9435347628 6.89 7.14 7.4 8.35
128 2 272.81 1.093953003 7.33 7.61 7.77 8.35
128 4 468.98 1.273087573 8.53 8.71 8.83 9.85
128 8 705.67 1.191627687 11.34 11.64 11.9 13.1
384 1 118.34 1.042459479 8.45 8.82 8.99 9.52
384 2 197.8 1.231478023 10.11 10.48 10.62 11.4
384 4 275.19 1.268332027 14.54 14.73 14.8 16.8
384 8 342.22 1.416004634 23.38 23.64 23.75 24.1

BERT-Large TF32

Sequence Length Batch Size Throughput-Average(sent/sec) Latency-Average(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms)
128 1 153.9 6.5 6.76 6.86 7.4
128 2 249.38 8.02 8.22 8.34 9.45
128 4 368.38 10.86 11.11 11.24 12.76
128 8 592.19 13.51 13.64 13.77 15.85
384 1 113.52 8.81 9.02 9.16 10.19
384 2 160.62 12.45 12.61 12.68 14.47
384 4 216.97 18.44 18.6 18.7 18.84
384 8 241.68 33.1 33.29 33.36 33.5

BERT-Base FP16

Sequence Length Batch Size Throughput-Average(sent/sec) Throughput speedup (FP32 to mixed precision) Latency-Average(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms)
128 1 295.01 1.014023992 3.39 3.59 3.65 3.73
128 2 594.81 1.048455898 3.36 3.59 3.68 4.19
128 4 1043.12 1.005145599 3.83 3.97 4.2 4.44
128 8 1786.25 1.198278638 4.48 4.73 4.8 5.19
384 1 278.85 1.103395062 3.59 3.67 3.99 4.15
384 2 464.77 1.252006896 4.3 4.59 4.87 5.29
384 4 675.82 1.264822578 5.92 6.15 6.27 6.94
384 8 846.81 1.31109494 9.45 9.65 9.74 11.03

BERT-Base TF32

Sequence Length Batch Size Throughput-Average(sent/sec) Latency-Average(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms)
128 1 290.93 3.44 3.61 3.73 4.69
128 2 567.32 3.53 3.64 3.96 5.01
128 4 1037.78 3.85 3.95 4.06 4.58
128 8 1490.68 5.37 5.61 5.66 6.19
384 1 252.72 3.96 3.96 4.52 4.66
384 2 371.22 5.39 5.64 5.71 6.38
384 4 534.32 7.49 7.69 7.76 8.56
384 8 645.88 12.39 12.61 12.67 12.77

To achieve these same results, follow the Quick Start Guide outlined above.

Fine-tuning inference performance for SQuAD v1.1 on NVIDIA Tesla T4 (1x T4 16GB)

Our results were obtained by running the scripts/finetune_inference_benchmark.sh training script in the TensorFlow 21.02-py3 NGC container on NVIDIA Tesla T4 with 1x T4 16GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1000 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

BERT-Large FP16

Sequence Length Batch Size Throughput-Average(sent/sec) Throughput speedup (FP32 to mixed precision) Latency-Average(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms)
128 1 57.6 1.364605544 17.36 18.16 19.02 21.67
128 2 102.76 2.17988969 19.46 20.68 21.27 22.2
128 4 151.11 3.146813828 26.47 26.9 27.06 27.45
128 8 186.99 3.733080455 42.78 43.87 44.18 44.78
384 1 38.88 2.590273151 25.72 26.06 26.16 26.38
384 2 50.53 3.202154626 39.58 39.93 40.35 40.95
384 4 57.69 3.721935484 69.34 70.5 70.77 71.09
384 8 62.99 3.927057357 127 129.18 130.07 131.86

BERT-Large FP32

Sequence Length Batch Size Throughput-Average(sent/sec) Latency-Average(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms)
128 1 42.21 23.69 24.8 25.02 25.48
128 2 47.14 42.42 43.48 43.63 44.32
128 4 48.02 83.29 84.37 84.68 85.14
128 8 50.09 159.72 161.66 161.97 162.52
384 1 15.01 66.63 67.76 68.08 68.66
384 2 15.78 126.78 128.21 128.58 129.08
384 4 15.5 258.1 261.01 261.66 262.55
384 8 16.04 498.61 504.29 504.74 505.55

BERT-Base FP16

Sequence Length Batch Size Throughput-Average(sent/sec) Throughput speedup (FP32 to mixed precision) Latency-Average(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms)
128 1 116.56 1.039878669 8.58 9.53 10.84 11.74
128 2 238.62 1.675937632 8.38 9.09 9.27 12.33
128 4 402.93 2.440964439 9.93 10.07 10.13 12.17
128 8 532.56 3.052619512 15.02 15.43 15.6 16.52
384 1 102.12 2.035073735 9.79 11.06 11.18 12.07
384 2 149.3 2.910898811 13.4 13.54 13.62 14.36
384 4 177.78 3.563439567 22.5 23.11 23.27 23.59
384 8 192.61 3.752386519 41.53 42.67 42.81 43.31

BERT-Base FP32

Sequence Length Batch Size Throughput-Average(sent/sec) Latency-Average(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms)
128 1 112.09 8.92 9.12 9.49 10.93
128 2 142.38 14.05 14.34 14.48 15.03
128 4 165.07 24.23 24.86 24.92 25.05
128 8 174.46 45.86 46.71 46.8 47.2
384 1 50.18 19.93 20.53 21.04 21.73
384 2 51.29 38.99 39.68 39.93 40.2
384 4 49.89 80.18 81.54 82 82.65
384 8 51.33 155.85 158.11 158.5 159.17

To achieve these same results, follow the Quick Start Guide outlined above.