NGC | Catalog
CatalogResourcesBERT for TensorFlow

BERT for TensorFlow

For downloads and more information, please view on a desktop device.
Logo for BERT for TensorFlow

Description

BERT is a method of pre-training language representations which obtains state-of-the-art results on a wide array of NLP tasks.

Publisher

NVIDIA

Latest Version

20.06.17

Modified

April 4, 2023

Compressed Size

1.37 MB

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Both of these benchmarking scripts enable you to run a number of epochs, extract performance numbers, and run the BERT model for fine tuning.

Training performance benchmark

Training benchmarking can be performed by running the script:

scripts/finetune_train_benchmark.sh <bert_model> <use_xla> <num_gpu> squad

This script runs 2 epochs by default on the SQuAD v1.1 dataset and extracts performance numbers for various batch sizes and sequence lengths in both FP16 and FP32/TF32. These numbers are saved at /results/squad_train_benchmark_bert_<bert_model>_gpu_<num_gpu>.log.

Inference performance benchmark

Inference benchmarking can be performed by running the script:

scripts/finetune_inference_benchmark.sh squad

This script runs 1024 eval iterations by default on the SQuAD v1.1 dataset and extracts performance and latency numbers for various batch sizes and sequence lengths in both FP16 and FP32/TF32, for base and large models. These numbers are saved at /results/squad_inference_benchmark_bert_<bert_model>.log.

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference for pre-training using LAMB optimizer as well as fine tuning for Question Answering. All results are on BERT-large model unless otherwise mentioned. All fine tuning results are on SQuAD v1.1 using a sequence length of 384 unless otherwise mentioned.

Training accuracy results

Training accuracy
Pre-training accuracy

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 20.06-py3 NGC container.

DGX System Nodes x GPUs Precision Batch Size/GPU: Phase1, Phase2 Accumulation Steps: Phase1, Phase2 Time to Train (Hrs) Final Loss
DGX2H 32 x 16 FP16 64, 8 2, 8 2.63 1.59
DGX2H 32 x 16 FP32 32, 8 4, 8 8.48 1.56
DGXA100 32 x 8 FP16 64, 16 4, 8 3.24 1.56
DGXA100 32 x 8 TF32 64, 8 4, 16 4.58 1.58

Note: Time to train includes upto 16 minutes of start up time for every restart (atleast once for each phase). Experiments were run on clusters with a maximum wall clock time of 8 hours.

Fine-tuning accuracy for SQuAD v1.1: NVIDIA DGX A100 (8x A100 40G)

Our results were obtained by running the scripts/run_squad.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs.

GPUs **Batch size / GPU: TF32, FP16 ** Accuracy - TF32 Accuracy - mixed precision Time to Train - TF32 (Hrs) Time to Train - mixed precision (Hrs)
8 16, 24 91.41 91.52 0.26 0.26
Fine-tuning accuracy for GLUE MRPC: NVIDIA DGX A100 (8x A100 40G)

Our results were obtained by running the scripts/run_glue.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs for 10 different seeds and picking the maximum accuracy on MRPC dev set.

GPUs Batch size / GPU Accuracy - TF32 Accuracy - mixed precision Time to Train - TF32 (Hrs) Time to Train - mixed precision (Hrs) Throughput - TF32 **Throughput - mixed precision **
8 16 87.99 87.09 0.009 0.009 357.91 230.16
Training stability test
Pre-training SQuAD v1.1 stability test: NVIDIA DGX A100 (256x A100 40GB)

The following tables compare Final Loss scores across 2 different training runs with different seeds, for both FP16 and TF32. The runs showcase consistent convergence on all 2 seeds with very little deviation.

FP16, 256x GPUs seed 1 seed 2 mean std
Final Loss 1.570 1.561 1.565 0.006
TF32, 256x GPUs seed 1 seed 2 mean std
Final Loss 1.583 1.582 1.582 0.0007
Fine-tuning SQuAD v1.1 stability test: NVIDIA DGX A100 (8x A100 40GB)

The following tables compare F1 scores across 5 different training runs with different seeds, for both FP16 and TF32 respectively using (Nvidia's Pretrained Checkpoint)[https://ngc.nvidia.com/catalog/models/nvidia:bert_tf_pretraining_lamb_16n]. The runs showcase consistent convergence on all 5 seeds with very little deviation.

FP16, 8x GPUs seed 1 seed 2 seed 3 seed 4 seed 5 mean std
F1 91.61 91.04 91.59 91.32 91.52 91.41 0.24
TF32, 8x GPUs seed 1 seed 2 seed 3 seed 4 seed 5 mean std
F1 91.50 91.49 91.64 91.29 91.67 91.52 0.15
Fine-tuning GLUE MRPC stability test: NVIDIA DGX A100 (8x A100 40GB)

The following tables compare F1 scores across 10 different training runs with different seeds, for both FP16 and TF32 respectively using (Nvidia's Pretrained Checkpoint)[https://ngc.nvidia.com/catalog/models/nvidia:bert_tf_pretraining_lamb_16n]. The runs showcase consistent convergence on all 10 seeds with very little deviation.

** FP16, 8 GPUs ** ** seed 1 ** ** seed 2 ** ** seed 3 ** ** seed 4 ** ** seed 5 ** ** seed 6 ** ** seed 7 ** ** seed 8 ** ** seed 9 ** ** seed 10 ** ** Mean ** ** Std **
Eval Accuracy 84.31372643 85.78431606 86.76471114 87.00980544 86.27451062 86.27451062 85.5392158 86.51961088 86.27451062 85.2941215 86.00490391 0.795887906
** TF32, 8 GPUs ** ** seed 1 ** ** seed 2 ** ** seed 3 ** ** seed 4 ** ** seed 5 ** ** seed 6 ** ** seed 7 ** ** seed 8 ** ** seed 9 ** ** seed 10 ** ** Mean ** ** Std **
Eval Accuracy 87.00980544 86.27451062 87.99020052 86.27451062 86.02941632 87.00980544 86.27451062 86.51961088 87.74510026 86.02941632 86.7156887 0.7009024515

Training performance results

Training performance: NVIDIA DGX-1 (8x V100 16GB)
Pre-training training performance: single-node on DGX-1 16GB

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUs Sequence Length Batch size / GPU: mixed precision, FP32 Gradient Accumulation: mixed precision, FP32 Global Batch Size Throughput - mixed precision Throughput - FP32 Throughput speedup (FP32 - mixed precision) Weak scaling - mixed precision Weak scaling - FP32
1 128 16 , 8 4096, 8192 65536 134.34 39.43 3.41 1.00 1.00
4 128 16 , 8 1024, 2048 65536 449.68 152.33 2.95 3.35 3.86
8 128 16 , 8 512, 1024 65536 1001.39 285.79 3.50 7.45 7.25
1 512 4 , 2 8192, 16384 32768 28.72 9.80 2.93 1.00 1.00
4 512 4 , 2 2048, 4096 32768 109.96 35.32 3.11 3.83 3.60
8 512 4 , 2 1024, 2048 32768 190.65 69.53 2.74 6.64 7.09

Note: The respective values for FP32 runs that use a batch size of 16, 4 in sequence lengths 128 and 512 respectively are not available due to out of memory errors that arise.

Fine-tuning training performance for SQuAD v1.1 on DGX-1 16GB

Our results were obtained by running the scripts/run_squad.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUs Batch size / GPU: mixed precision, FP32 Throughput - mixed precision Throughput - FP32 Throughput speedup (FP32 to mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
1 4,2 29.74 7.36 4.04 1.00 1.00
4 4,2 97.28 26.64 3.65 3.27 3.62
8 4,2 189.77 52.39 3.62 6.38 7.12

Note: The respective values for FP32 runs that use a batch size of 4 are not available due to out of memory errors that arise.

To achieve these same results, follow the Quick Start Guide outlined above.

Training performance: NVIDIA DGX-1 (8x V100 32GB)
Pre-training training performance: single-node on DGX-1 32GB

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUs Sequence Length Batch size / GPU: mixed precision, FP32 Gradient Accumulation: mixed precision, FP32 Global Batch Size Throughput - mixed precision Throughput - FP32 Throughput speedup (FP32 - mixed precision) Weak scaling - mixed precision Weak scaling - FP32
1 128 64 , 32 1024, 2048 65536 168.63 46.78 3.60 1.00 1.00
4 128 64 , 32 256, 512 65536 730.25 179.73 4.06 4.33 3.84
8 128 64 , 32 128, 256 65536 1443.05 357.00 4.04 8.56 7.63
1 512 8 , 8 4096, 4096 32768 31.23 10.67 2.93 1.00 1.00
4 512 8 , 8 1024, 1024 32768 118.84 39.55 3.00 3.81 3.71
8 512 8 , 8 512, 512 32768 255.64 81.42 3.14 8.19 7.63

Note: The respective values for FP32 runs that use a batch size of 64 in sequence lengths 128 are not available due to out of memory errors that arise.

Fine-tuning training performance for SQuAD v1.1 on DGX-1 32GB

Our results were obtained by running the scripts/run_squad.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUs Batch size / GPU: mixed precision, FP32 Throughput - mixed precision Throughput - FP32 Throughput speedup (FP32 to mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
1 24, 10 51.02 10.42 4.90 1.00 1.00
4 24, 10 181.37 39.77 4.56 3.55 3.82
8 24, 10 314.6 79.37 3.96 6.17 7.62

Note: The respective values for FP32 runs that use a batch size of 24 are not available due to out of memory errors that arise.

To achieve these same results, follow the Quick Start Guide outlined above.

Training performance: NVIDIA DGX-2 (16x V100 32GB)
Pre-training training performance: single-node on DGX-2 32GB

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-2 with 16x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUs Sequence Length Batch size / GPU: mixed precision, FP32 Gradient Accumulation: mixed precision, FP32 Global Batch Size Throughput - mixed precision Throughput - FP32 Throughput speedup (FP32 - mixed precision) Weak scaling - mixed precision Weak scaling - FP32
1 128 64 , 32 1024 , 8192 65536 188.04 35.32 5.32 1.00 1.00
4 128 64 , 32 256 , 2048 65536 790.89 193.08 4.10 4.21 5.47
8 128 64 , 32 128 , 1024 65536 1556.89 386.89 4.02 8.28 10.95
16 128 64 , 32 64 , 128 65536 3081.69 761.92 4.04 16.39 21.57
1 512 8 , 8 4096 , 4096 32768 35.32 11.67 3.03 1.00 1.00
4 512 8 , 8 1024 , 1024 32768 128.98 42.84 3.01 3.65 3.67
8 512 8 , 8 512 , 512 32768 274.04 86.78 3.16 7.76 7.44
16 512 8 , 8 256 , 256 32768 513.43 173.26 2.96 14.54 14.85

Note: The respective values for FP32 runs that use a batch size of 64 in sequence lengths 128 are not available due to out of memory errors that arise.

Pre-training training performance: multi-node on DGX-2H 32GB

Our results were obtained by running the run.sub training script in the TensorFlow 19.08-py3 NGC container using multiple NVIDIA DGX-2 with 16x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.

Num Nodes Sequence Length Batch size / GPU: mixed precision, FP32 Gradient Accumulation: mixed precision, FP32 Global Batch Size Throughput - mixed precision Throughput - FP32 Throughput speedup (FP32 - mixed precision) Weak scaling - mixed precision Weak scaling - FP32
1 128 64 , 32 64 , 128 65536 3081.69 761.92 4.04 1.00 1.00
4 128 64 , 32 16 , 32 65536 13192.00 3389.83 3.89 4.28 4.45
16 128 64 , 32 4 , 8 65536 48223.00 13217.78 3.65 15.65 17.35
32 128 64 , 32 2 , 4 65536 86673.64 25142.26 3.45 28.13 33.00
1 512 8 , 8 256 , 256 32768 577.79 173.26 3.33 1.00 1.00
4 512 8 , 8 64 , 64 32768 2284.23 765.04 2.99 3.95 4.42
16 512 8 , 8 16 , 16 32768 8853.00 3001.43 2.95 15.32 17.32
32 512 8 , 8 8 , 8 32768 17059.00 5893.14 2.89 29.52 34.01

Note: The respective values for FP32 runs that use a batch size of 64 in sequence lengths 128 are not available due to out of memory errors that arise.

Fine-tuning training performance for SQuAD v1.1 on DGX-2 32GB

Our results were obtained by running the scripts/run_squad.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-2 with 16x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUs Batch size / GPU: mixed precision, FP32 Throughput - mixed precision Throughput - FP32 Throughput speedup (FP32 to mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
1 24, 10 55.28 11.15 4.96 1.00 1.00
4 24, 10 199.53 42.91 4.65 3.61 3.85
8 24, 10 341.55 85.08 4.01 6.18 7.63
16 24, 10 683.37 156.29 4.37 12.36 14.02

Note: The respective values for FP32 runs that use a batch size of 24 are not available due to out of memory errors that arise.

To achieve these same results, follow the Quick Start Guide outlined above.

Training performance: NVIDIA DGX A100 (8x A100 40GB)
Pre-training training performance: single-node on DGX A100 40GB

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUs Sequence Length Batch size / GPU: mixed precision, TF32 Gradient Accumulation: mixed precision, TF32 Global Batch Size Throughput - mixed precision Throughput - TF32 Throughput speedup (TF32 - mixed precision) Weak scaling - mixed precision Weak scaling -TF32
1 128 64 , 64 1024 , 1024 65536 356.845 238.10 1.50 1.00 1.00
4 128 64 , 64 256 , 256 65536 1422.25 952.39 1.49 3.99 4.00
8 128 64 , 64 128 , 128 65536 2871.89 1889.71 1.52 8.05 7.94
1 512 16 , 8 2048 , 4096 32768 70.856 39.96 1.77 1.00 1.00
4 512 16 , 8 512 , 1024 32768 284.912 160.16 1.78 4.02 4.01
8 512 16 , 8 256 , 512 32768 572.112 316.51 1.81 8.07 7.92

Note: The respective values for TF32 runs that use a batch size of 16 for sequence length 512 are not available due to out of memory errors that arise.

Pre-training training performance: multi-node on DGX A100 40GB

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs. Performance (in sentences per second) is the steady state throughput.

Num Nodes Sequence Length Batch size / GPU: mixed precision, TF32 Gradient Accumulation: mixed precision, TF32 Global Batch Size Throughput - mixed precision Throughput - TF32 Throughput speedup (TF32 - mixed precision) Weak scaling - mixed precision Weak scaling -TF32
1 128 64 , 64 128 , 128 65536 2871.89 1889.71 1.52 1.00 1.00
4 128 64 , 64 32 , 32 65536 11159 7532.00 1.48 3.89 3.99
16 128 64 , 64 8 , 8 65536 41144 28605.62 1.44 14.33 15.14
32 128 64 , 64 4 , 4 65536 77479.87 53585.82 1.45 26.98 28.36
1 512 16 , 8 256 , 512 32768 572.112 316.51 1.81 1.00 1.00
4 512 16 , 8 128 , 128 65536 2197.44 1268.43 1.73 3.84 4.01
16 512 16 , 8 32 , 32 65536 8723.1 4903.39 1.78 15.25 15.49
32 512 16 , 8 16 , 16 65536 16705 9463.80 1.77 29.20 29.90

Note: The respective values for TF32 runs that use a batch size of 16 for sequence length 512 are not available due to out of memory errors that arise.

Fine-tuning training performance for SQuAD v1.1 on DGX A100 40GB

Our results were obtained by running the scripts/run_squad.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUs Batch size / GPU: mixed precision, TF32 Throughput - mixed precision Throughput - TF32 Throughput speedup (TF32 to mixed precision) Weak scaling - TF32 Weak scaling - mixed precision
1 32, 16 102.26 61.364 1.67 1.00 1.00
4 32, 16 366.353 223.187 1.64 3.64 3.58
8 32, 16 767.071 440.47 1.74 7.18 7.50

Note: The respective values for TF32 runs that use a batch size of 32 are not available due to out of memory errors that arise.

To achieve these same results, follow the Quick Start Guide outlined above.

Inference performance results

Inference performance: NVIDIA DGX-1 (1x V100 16GB)
Fine-tuning inference performance for SQuAD v1.1 on 16GB

Our results were obtained by running the scripts/finetune_inference_benchmark.sh script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

Model Sequence Length Batch Size Precision Throughput-Average(sent/sec) Latency-Average(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms)
base 128 1 fp16 206.82 7.96 4.98 5.04 5.23
base 128 2 fp16 376.75 8.68 5.42 5.49 5.64
base 128 4 fp16 635 12.31 6.46 6.55 6.83
base 128 8 fp16 962.83 13.64 8.47 8.56 8.75
base 384 1 fp16 167.01 12.77 6.12 6.23 6.52
base 384 2 fp16 252.12 21.05 8.03 8.09 8.61
base 384 4 fp16 341.95 25.09 11.88 11.96 12.52
base 384 8 fp16 421.26 33.16 19.2 19.37 19.91
base 128 1 fp32 174.48 8.17 5.89 5.95 6.12
base 128 2 fp32 263.67 10.33 7.66 7.69 7.92
base 128 4 fp32 349.34 16.31 11.57 11.62 11.87
base 128 8 fp32 422.88 23.27 19.23 19.38 20.38
base 384 1 fp32 99.52 14.99 10.19 10.23 10.78
base 384 2 fp32 118.01 25.98 17.12 17.18 17.78
base 384 4 fp32 128.1 41 31.56 31.7 32.39
base 384 8 fp32 136.1 69.77 59.44 59.66 60.51
large 128 1 fp16 98.63 15.86 10.27 10.31 10.46
large 128 2 fp16 172.59 17.78 11.81 11.86 12.13
large 128 4 fp16 272.86 25.66 14.86 14.94 15.18
large 128 8 fp16 385.64 30.74 20.98 21.1 21.68
large 384 1 fp16 70.74 26.85 14.38 14.47 14.7
large 384 2 fp16 99.9 45.29 20.26 20.43 21.11
large 384 4 fp16 128.42 56.94 31.44 31.71 32.45
large 384 8 fp16 148.57 81.69 54.23 54.54 55.53
large 128 1 fp32 76.75 17.06 13.21 13.27 13.4
large 128 2 fp32 100.82 24.34 20.05 20.13 21.13
large 128 4 fp32 117.59 41.76 34.42 34.55 35.29
large 128 8 fp32 130.42 68.59 62 62.23 62.98
large 384 1 fp32 33.95 37.89 29.82 29.98 30.56
large 384 2 fp32 38.47 68.35 52.56 52.74 53.89
large 384 4 fp32 41.11 114.27 98.19 98.54 99.54
large 384 8 fp32 41.32 213.84 194.92 195.36 196.94

To achieve these same results, follow the Quick Start Guide outlined above.

Inference performance: NVIDIA DGX-1 (1x V100 32GB)
Fine-tuning inference performance for SQuAD v1.1 on 32GB

Our results were obtained by running the scripts/finetune_inference_benchmark.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 32GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

Model Sequence Length Batch Size Precision Throughput-Average(sent/sec) Latency-Average(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms)
base 128 1 fp16 207.87 7.63 4.94 5.03 5.32
base 128 2 fp16 376.44 8.47 5.44 5.5 5.68
base 128 4 fp16 642.55 11.63 6.3 6.36 6.68
base 128 8 fp16 943.85 13.24 8.56 8.68 8.92
base 384 1 fp16 162.62 12.24 6.31 6.4 6.73
base 384 2 fp16 244.15 20.05 8.34 8.41 8.93
base 384 4 fp16 338.68 23.53 11.88 11.92 12.63
base 384 8 fp16 407.46 32.72 19.84 20.06 20.89
base 128 1 fp32 175.16 8.31 5.85 5.89 6.04
base 128 2 fp32 261.31 10.48 7.75 7.81 8.08
base 128 4 fp32 339.45 16.67 11.95 12.02 12.46
base 128 8 fp32 406.67 24.12 19.86 19.97 20.41
base 384 1 fp32 98.33 15.28 10.27 10.32 10.76
base 384 2 fp32 114.92 26.88 17.55 17.59 18.29
base 384 4 fp32 125.76 41.74 32.06 32.23 33.72
base 384 8 fp32 136.62 69.78 58.95 59.19 60
large 128 1 fp16 96.46 15.56 10.56 10.66 11.02
large 128 2 fp16 168.31 17.42 12.11 12.25 12.57
large 128 4 fp16 267.76 24.76 15.17 15.36 16.68
large 128 8 fp16 378.28 30.34 21.39 21.54 21.97
large 384 1 fp16 68.75 26.02 14.77 14.94 15.3
large 384 2 fp16 95.41 44.01 21.24 21.47 22.01
large 384 4 fp16 124.43 55.14 32.53 32.83 33.58
large 384 8 fp16 143.02 81.37 56.51 56.88 58.05
large 128 1 fp32 75.34 17.5 13.46 13.52 13.7
large 128 2 fp32 99.73 24.7 20.27 20.38 21.45
large 128 4 fp32 116.92 42.1 34.49 34.59 34.98
large 128 8 fp32 130.11 68.95 62.03 62.23 63.3
large 384 1 fp32 33.84 38.15 29.75 29.89 31.23
large 384 2 fp32 38.02 69.31 53.1 53.36 54.42
large 384 4 fp32 41.2 114.34 97.96 98.32 99.55
large 384 8 fp32 42.37 209.16 190.18 190.66 192.77

To achieve these same results, follow the Quick Start Guide outlined above.

Inference performance: NVIDIA DGX-2 (1x V100 32GB)
Fine-tuning inference performance for SQuAD v1.1 on DGX-2 32GB

Our results were obtained by running the scripts/finetune_inference_benchmark.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-2 with 1x V100 32GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

Model Sequence Length Batch Size Precision Throughput-Average(sent/sec) Latency-Average(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms)
base 128 1 fp16 220.35 7.82 4.7 4.83 5.15
base 128 2 fp16 384.55 8.7 5.49 5.68 6.01
base 128 4 fp16 650.7 36.3 6.35 6.51 6.87
base 128 8 fp16 992.41 13.59 8.22 8.37 8.96
base 384 1 fp16 172.89 12.86 5.94 6.04 6.44
base 384 2 fp16 258.48 20.42 7.89 8.09 9.15
base 384 4 fp16 346.34 24.93 11.97 12.12 12.76
base 384 8 fp16 430.4 33.08 18.75 19.27 20.12
base 128 1 fp32 183.69 7.52 5.86 5.97 6.27
base 128 2 fp32 282.95 9.51 7.31 7.49 7.83
base 128 4 fp32 363.83 15.12 11.35 11.47 11.74
base 128 8 fp32 449.12 21.65 18 18.1 18.6
base 384 1 fp32 104.92 13.8 9.9 9.99 10.48
base 384 2 fp32 123.55 24.21 16.29 16.4 17.61
base 384 4 fp32 139.38 36.69 28.89 29.04 30.01
base 384 8 fp32 146.28 64.69 55.09 55.32 56.3
large 128 1 fp16 98.34 15.85 10.61 10.78 11.5
large 128 2 fp16 172.95 17.8 11.91 12.08 12.42
large 128 4 fp16 278.82 25.18 14.7 14.87 15.65
large 128 8 fp16 402.28 30.45 20.21 20.43 21.24
large 384 1 fp16 71.1 26.55 14.44 14.61 15.32
large 384 2 fp16 100.48 44.04 20.31 20.48 21.6
large 384 4 fp16 131.68 56.19 30.8 31.03 32.3
large 384 8 fp16 151.81 81.53 53.22 53.87 55.34
large 128 1 fp32 77.87 16.33 13.33 13.45 13.77
large 128 2 fp32 105.41 22.77 19.39 19.52 19.86
large 128 4 fp32 124.16 38.61 32.69 32.88 33.9
large 128 8 fp32 137.69 64.61 58.62 58.89 59.94
large 384 1 fp32 36.34 34.94 27.72 27.81 28.21
large 384 2 fp32 41.11 62.54 49.14 49.32 50.25
large 384 4 fp32 43.32 107.53 93.07 93.47 94.27
large 384 8 fp32 44.64 196.28 180.21 180.75 182.41
Inference performance: NVIDIA DGX A100 (1x A100 40GB)
Fine-tuning inference performance for SQuAD v1.1 on DGX A100 40GB

Our results were obtained by running the scripts/finetune_inference_benchmark.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 1x A100 40GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

Model Sequence Length Batch Size Precision Throughput-Average(sent/sec) Latency-Average(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms)
base 128 1 fp16 231.37 6.43 4.57 4.68 4.93
base 128 2 fp16 454.54 6.77 4.66 4.77 4.96
base 128 4 fp16 842.34 8.8 4.91 4.98 5.39
base 128 8 fp16 1216.43 10.39 6.77 6.86 7.28
base 384 1 fp16 210.59 9.03 4.83 4.86 5.06
base 384 2 fp16 290.91 14.88 7.09 7.19 7.72
base 384 4 fp16 407.13 18.04 9.93 10.05 10.74
base 384 8 fp16 478.67 26.06 16.92 17.19 17.76
base 128 1 tf32 223.38 6.94 4.73 4.86 5.04
base 128 2 tf32 447.57 7.2 4.68 4.82 5.07
base 128 4 tf32 838.89 9.16 4.88 4.93 5.38
base 128 8 tf32 1201.05 10.81 6.88 6.99 7.21
base 384 1 tf32 206.46 9.74 4.93 4.98 5.25
base 384 2 tf32 287 15.57 7.18 7.27 7.87
base 384 4 tf32 396.59 18.94 10.3 10.41 11.04
base 384 8 tf32 479.04 26.81 16.88 17.25 17.74
base 128 1 fp32 152.92 9.13 6.76 6.91 7.06
base 128 2 fp32 297.42 9.51 6.93 7.07 7.21
base 128 4 fp32 448.57 11.81 9.12 9.25 9.68
base 128 8 fp32 539.94 17.49 15 15.1 15.79
base 384 1 fp32 115.19 13.69 8.89 8.98 9.27
base 384 2 fp32 154.66 18.49 13.06 13.14 13.89
base 384 4 fp32 174.28 28.75 23.11 23.24 24
base 384 8 fp32 191.97 48.05 41.85 42.25 42.8
large 128 1 fp16 127.75 11.18 8.14 8.25 8.53
large 128 2 fp16 219.49 12.76 9.4 9.54 9.89
large 128 4 fp16 315.83 19.01 12.87 12.98 13.37
large 128 8 fp16 495.75 22.21 16.33 16.45 16.79
large 384 1 fp16 96.65 17.46 10.52 10.6 11
large 384 2 fp16 126.07 29.43 16.09 16.22 16.78
large 384 4 fp16 165.21 38.39 24.41 24.61 25.38
large 384 8 fp16 182.13 61.04 44.32 44.61 45.23
large 128 1 tf32 133.24 10.86 7.77 7.87 8.23
large 128 2 tf32 218.13 12.86 9.44 9.56 9.85
large 128 4 tf32 316.25 18.98 12.91 13.01 13.57
large 128 8 tf32 495.21 22.25 16.4 16.51 17.23
large 384 1 tf32 95.43 17.5 10.72 10.83 11.49
large 384 2 tf32 125.99 29.47 16.06 16.15 16.67
large 384 4 tf32 164.28 38.77 24.6 24.83 25.59
large 384 8 tf32 182.46 61 44.2 44.46 45.15
large 128 1 fp32 50.43 23.83 20.11 20.2 20.56
large 128 2 fp32 94.47 25.53 21.36 21.49 21.78
large 128 4 fp32 141.52 32.51 28.44 28.57 28.99
large 128 8 fp32 166.37 52.07 48.3 48.43 49.46
large 384 1 fp32 44.42 30.54 22.67 22.74 23.46
large 384 2 fp32 50.29 48.74 39.95 40.06 40.59
large 384 4 fp32 55.58 81.55 72.31 72.6 73.7
large 384 8 fp32 58.38 147.63 137.43 137.82 138.3

To achieve these same results, follow the Quick Start Guide outlined above.

Inference performance: NVIDIA Tesla T4 (1x T4 16GB)
Fine-tuning inference performance for SQuAD v1.1 on Tesla T4 16GB

Our results were obtained by running the scripts/finetune_inference_benchmark.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA Tesla T4 with 1x T4 16GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

Model Sequence Length Batch Size Precision Throughput-Average(sent/sec) Latency-Average(ms) Latency-50%(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms) Latency-100%(ms)
base 128 1 fp16 91.93 13.94 10.93 11.41 11.52 11.94 5491.47
base 128 2 fp16 148.08 16.91 13.65 13.95 14.06 14.74 5757.12
base 128 4 fp16 215.45 24.56 18.68 18.92 19.08 19.84 5894.82
base 128 8 fp16 289.52 33.07 27.77 28.22 28.38 29.16 6074.47
base 384 1 fp16 60.75 23.18 16.6 16.93 17.03 17.45 7006.41
base 384 2 fp16 82.85 37.05 24.26 24.54 24.63 25.67 7529.94
base 384 4 fp16 97.78 54.4 41.02 41.53 41.94 43.91 7995.39
base 384 8 fp16 106.78 89.6 74.98 75.5 76.13 78.02 8461.93
base 128 1 fp32 54.28 20.88 18.52 18.8 18.92 19.29 4401.4
base 128 2 fp32 71.75 30.57 28.08 28.51 28.62 29.12 4573.47
base 128 4 fp32 88.01 50.37 45.61 45.94 46.14 47.04 4992.7
base 128 8 fp32 98.92 85.57 80.98 81.44 81.74 82.75 5408.97
base 384 1 fp32 25.83 43.63 38.75 39.33 39.43 40.02 5148.45
base 384 2 fp32 29.08 77.68 68.89 69.26 69.55 72.08 5462.5
base 384 4 fp32 30.33 141.45 131.86 132.53 133.14 136.7 5975.63
base 384 8 fp32 31.8 262.88 251.62 252.23 253.08 255.56 7124
large 128 1 fp16 40.31 30.61 25.14 25.62 25.87 27.61 10395.87
large 128 2 fp16 63.79 37.43 31.66 32.31 32.66 34.36 10302.2
large 128 4 fp16 87.4 56.5 45.97 46.6 47.01 48.71 10391.17
large 128 8 fp16 107.5 84.29 74.59 75.25 75.64 77.73 10945.1
large 384 1 fp16 23.05 55.73 43.72 44.28 44.74 46.8 12889.23
large 384 2 fp16 29.59 91.61 67.94 68.8 69.45 71.64 13876.35
large 384 4 fp16 34.27 141.56 116.67 118.02 119.1 122.1 14570.73
large 384 8 fp16 38.29 237.85 208.95 210.08 211.33 214.61 16626.02
large 128 1 fp32 21.52 50.46 46.48 47.63 47.94 49.63 7150.38
large 128 2 fp32 25.4 83.3 79.06 79.61 80.06 81.77 7763.11
large 128 4 fp32 28.19 149.49 142.15 143.1 143.65 145.43 7701.38
large 128 8 fp32 30.14 272.84 265.6 266.57 267.21 269.37 8246.3
large 384 1 fp32 8.46 126.81 118.44 119.42 120.31 122.74 9007.96
large 384 2 fp32 9.29 231 215.54 216.64 217.71 220.35 9755.69
large 384 4 fp32 9.55 436.5 418.71 420.05 421.27 424.3 11766.45
large 384 8 fp32 9.75 840.9 820.39 822.19 823.69 827.99 12856.99

To achieve these same results, follow the Quick Start Guide outlined above.