NVIDIA
NVIDIA
BERT for TensorFlow
Resource
NVIDIA
NVIDIA
BERT for TensorFlow

BERT is a method of pre-training language representations which obtains state-of-the-art results on a wide array of NLP tasks.

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Both of these benchmarking scripts enable you to run a number of epochs, extract performance numbers, and run the BERT model for fine tuning.

Training performance benchmark

Training benchmarking can be performed by running the script:

scripts/finetune_train_benchmark.sh <bert_model> <use_xla> <num_gpu> squad

This script runs 2 epochs by default on the SQuAD v1.1 dataset and extracts performance numbers for various batch sizes and sequence lengths in both FP16 and FP32/TF32. These numbers are saved at /results/squad_train_benchmark_bert_<bert_model>_gpu_<num_gpu>.log.

Inference performance benchmark

Inference benchmarking can be performed by running the script:

scripts/finetune_inference_benchmark.sh squad

This script runs 1024 eval iterations by default on the SQuAD v1.1 dataset and extracts performance and latency numbers for various batch sizes and sequence lengths in both FP16 and FP32/TF32, for base and large models. These numbers are saved at /results/squad_inference_benchmark_bert_<bert_model>.log.

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference for pre-training using LAMB optimizer as well as fine tuning for Question Answering. All results are on BERT-large model unless otherwise mentioned. All fine tuning results are on SQuAD v1.1 using a sequence length of 384 unless otherwise mentioned.

Training accuracy results

Training accuracy
Pre-training accuracy

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 20.06-py3 NGC container.

DGX SystemNodes x GPUsPrecisionBatch Size/GPU: Phase1, Phase2Accumulation Steps: Phase1, Phase2Time to Train (Hrs)Final Loss
DGX2H32 x 16FP1664, 82, 82.631.59
DGX2H32 x 16FP3232, 84, 88.481.56
DGXA10032 x 8FP1664, 164, 83.241.56
DGXA10032 x 8TF3264, 84, 164.581.58

Note: Time to train includes upto 16 minutes of start up time for every restart (atleast once for each phase). Experiments were run on clusters with a maximum wall clock time of 8 hours.

Fine-tuning accuracy for SQuAD v1.1: NVIDIA DGX A100 (8x A100 40G)

Our results were obtained by running the scripts/run_squad.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs.

GPUs**Batch size / GPU: TF32, FP16 **Accuracy - TF32Accuracy - mixed precisionTime to Train - TF32 (Hrs)Time to Train - mixed precision (Hrs)
816, 2491.4191.520.260.26
Fine-tuning accuracy for GLUE MRPC: NVIDIA DGX A100 (8x A100 40G)

Our results were obtained by running the scripts/run_glue.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs for 10 different seeds and picking the maximum accuracy on MRPC dev set.

GPUsBatch size / GPUAccuracy - TF32Accuracy - mixed precisionTime to Train - TF32 (Hrs)Time to Train - mixed precision (Hrs)Throughput - TF32**Throughput - mixed precision **
81687.9987.090.0090.009357.91230.16
Training stability test
Pre-training SQuAD v1.1 stability test: NVIDIA DGX A100 (256x A100 40GB)

The following tables compare Final Loss scores across 2 different training runs with different seeds, for both FP16 and TF32. The runs showcase consistent convergence on all 2 seeds with very little deviation.

FP16, 256x GPUsseed 1seed 2meanstd
Final Loss1.5701.5611.5650.006
TF32, 256x GPUsseed 1seed 2meanstd
Final Loss1.5831.5821.5820.0007
Fine-tuning SQuAD v1.1 stability test: NVIDIA DGX A100 (8x A100 40GB)

The following tables compare F1 scores across 5 different training runs with different seeds, for both FP16 and TF32 respectively using (Nvidia's Pretrained Checkpoint)[https://ngc.nvidia.com/catalog/models/nvidia:bert_tf_pretraining_lamb_16n]. The runs showcase consistent convergence on all 5 seeds with very little deviation.

FP16, 8x GPUsseed 1seed 2seed 3seed 4seed 5meanstd
F191.6191.0491.5991.3291.5291.410.24
TF32, 8x GPUsseed 1seed 2seed 3seed 4seed 5meanstd
F191.5091.4991.6491.2991.6791.520.15
Fine-tuning GLUE MRPC stability test: NVIDIA DGX A100 (8x A100 40GB)

The following tables compare F1 scores across 10 different training runs with different seeds, for both FP16 and TF32 respectively using (Nvidia's Pretrained Checkpoint)[https://ngc.nvidia.com/catalog/models/nvidia:bert_tf_pretraining_lamb_16n]. The runs showcase consistent convergence on all 10 seeds with very little deviation.

** FP16, 8 GPUs **** seed 1 **** seed 2 **** seed 3 **** seed 4 **** seed 5 **** seed 6 **** seed 7 **** seed 8 **** seed 9 **** seed 10 **** Mean **** Std **
Eval Accuracy84.3137264385.7843160686.7647111487.0098054486.2745106286.2745106285.539215886.5196108886.2745106285.294121586.004903910.795887906
** TF32, 8 GPUs **** seed 1 **** seed 2 **** seed 3 **** seed 4 **** seed 5 **** seed 6 **** seed 7 **** seed 8 **** seed 9 **** seed 10 **** Mean **** Std **
Eval Accuracy87.0098054486.2745106287.9902005286.2745106286.0294163287.0098054486.2745106286.5196108887.7451002686.0294163286.71568870.7009024515

Training performance results

Training performance: NVIDIA DGX-1 (8x V100 16GB)
Pre-training training performance: single-node on DGX-1 16GB

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUsSequence LengthBatch size / GPU: mixed precision, FP32Gradient Accumulation: mixed precision, FP32Global Batch SizeThroughput - mixed precisionThroughput - FP32Throughput speedup (FP32 - mixed precision)Weak scaling - mixed precisionWeak scaling - FP32
112816 , 84096, 819265536134.3439.433.411.001.00
412816 , 81024, 204865536449.68152.332.953.353.86
812816 , 8512, 1024655361001.39285.793.507.457.25
15124 , 28192, 163843276828.729.802.931.001.00
45124 , 22048, 409632768109.9635.323.113.833.60
85124 , 21024, 204832768190.6569.532.746.647.09

Note: The respective values for FP32 runs that use a batch size of 16, 4 in sequence lengths 128 and 512 respectively are not available due to out of memory errors that arise.

Fine-tuning training performance for SQuAD v1.1 on DGX-1 16GB

Our results were obtained by running the scripts/run_squad.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUsBatch size / GPU: mixed precision, FP32Throughput - mixed precisionThroughput - FP32Throughput speedup (FP32 to mixed precision)Weak scaling - FP32Weak scaling - mixed precision
14,229.747.364.041.001.00
44,297.2826.643.653.273.62
84,2189.7752.393.626.387.12

Note: The respective values for FP32 runs that use a batch size of 4 are not available due to out of memory errors that arise.

To achieve these same results, follow the Quick Start Guide outlined above.

Training performance: NVIDIA DGX-1 (8x V100 32GB)
Pre-training training performance: single-node on DGX-1 32GB

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUsSequence LengthBatch size / GPU: mixed precision, FP32Gradient Accumulation: mixed precision, FP32Global Batch SizeThroughput - mixed precisionThroughput - FP32Throughput speedup (FP32 - mixed precision)Weak scaling - mixed precisionWeak scaling - FP32
112864 , 321024, 204865536168.6346.783.601.001.00
412864 , 32256, 51265536730.25179.734.064.333.84
812864 , 32128, 256655361443.05357.004.048.567.63
15128 , 84096, 40963276831.2310.672.931.001.00
45128 , 81024, 102432768118.8439.553.003.813.71
85128 , 8512, 51232768255.6481.423.148.197.63

Note: The respective values for FP32 runs that use a batch size of 64 in sequence lengths 128 are not available due to out of memory errors that arise.

Fine-tuning training performance for SQuAD v1.1 on DGX-1 32GB

Our results were obtained by running the scripts/run_squad.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUsBatch size / GPU: mixed precision, FP32Throughput - mixed precisionThroughput - FP32Throughput speedup (FP32 to mixed precision)Weak scaling - FP32Weak scaling - mixed precision
124, 1051.0210.424.901.001.00
424, 10181.3739.774.563.553.82
824, 10314.679.373.966.177.62

Note: The respective values for FP32 runs that use a batch size of 24 are not available due to out of memory errors that arise.

To achieve these same results, follow the Quick Start Guide outlined above.

Training performance: NVIDIA DGX-2 (16x V100 32GB)
Pre-training training performance: single-node on DGX-2 32GB

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-2 with 16x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUsSequence LengthBatch size / GPU: mixed precision, FP32Gradient Accumulation: mixed precision, FP32Global Batch SizeThroughput - mixed precisionThroughput - FP32Throughput speedup (FP32 - mixed precision)Weak scaling - mixed precisionWeak scaling - FP32
112864 , 321024 , 819265536188.0435.325.321.001.00
412864 , 32256 , 204865536790.89193.084.104.215.47
812864 , 32128 , 1024655361556.89386.894.028.2810.95
1612864 , 3264 , 128655363081.69761.924.0416.3921.57
15128 , 84096 , 40963276835.3211.673.031.001.00
45128 , 81024 , 102432768128.9842.843.013.653.67
85128 , 8512 , 51232768274.0486.783.167.767.44
165128 , 8256 , 25632768513.43173.262.9614.5414.85

Note: The respective values for FP32 runs that use a batch size of 64 in sequence lengths 128 are not available due to out of memory errors that arise.

Pre-training training performance: multi-node on DGX-2H 32GB

Our results were obtained by running the run.sub training script in the TensorFlow 19.08-py3 NGC container using multiple NVIDIA DGX-2 with 16x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.

Num NodesSequence LengthBatch size / GPU: mixed precision, FP32Gradient Accumulation: mixed precision, FP32Global Batch SizeThroughput - mixed precisionThroughput - FP32Throughput speedup (FP32 - mixed precision)Weak scaling - mixed precisionWeak scaling - FP32
112864 , 3264 , 128655363081.69761.924.041.001.00
412864 , 3216 , 326553613192.003389.833.894.284.45
1612864 , 324 , 86553648223.0013217.783.6515.6517.35
3212864 , 322 , 46553686673.6425142.263.4528.1333.00
15128 , 8256 , 25632768577.79173.263.331.001.00
45128 , 864 , 64327682284.23765.042.993.954.42
165128 , 816 , 16327688853.003001.432.9515.3217.32
325128 , 88 , 83276817059.005893.142.8929.5234.01

Note: The respective values for FP32 runs that use a batch size of 64 in sequence lengths 128 are not available due to out of memory errors that arise.

Fine-tuning training performance for SQuAD v1.1 on DGX-2 32GB

Our results were obtained by running the scripts/run_squad.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-2 with 16x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUsBatch size / GPU: mixed precision, FP32Throughput - mixed precisionThroughput - FP32Throughput speedup (FP32 to mixed precision)Weak scaling - FP32Weak scaling - mixed precision
124, 1055.2811.154.961.001.00
424, 10199.5342.914.653.613.85
824, 10341.5585.084.016.187.63
1624, 10683.37156.294.3712.3614.02

Note: The respective values for FP32 runs that use a batch size of 24 are not available due to out of memory errors that arise.

To achieve these same results, follow the Quick Start Guide outlined above.

Training performance: NVIDIA DGX A100 (8x A100 40GB)
Pre-training training performance: single-node on DGX A100 40GB

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUsSequence LengthBatch size / GPU: mixed precision, TF32Gradient Accumulation: mixed precision, TF32Global Batch SizeThroughput - mixed precisionThroughput - TF32Throughput speedup (TF32 - mixed precision)Weak scaling - mixed precisionWeak scaling -TF32
112864 , 641024 , 102465536356.845238.101.501.001.00
412864 , 64256 , 256655361422.25952.391.493.994.00
812864 , 64128 , 128655362871.891889.711.528.057.94
151216 , 82048 , 40963276870.85639.961.771.001.00
451216 , 8512 , 102432768284.912160.161.784.024.01
851216 , 8256 , 51232768572.112316.511.818.077.92

Note: The respective values for TF32 runs that use a batch size of 16 for sequence length 512 are not available due to out of memory errors that arise.

Pre-training training performance: multi-node on DGX A100 40GB

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs. Performance (in sentences per second) is the steady state throughput.

Num NodesSequence LengthBatch size / GPU: mixed precision, TF32Gradient Accumulation: mixed precision, TF32Global Batch SizeThroughput - mixed precisionThroughput - TF32Throughput speedup (TF32 - mixed precision)Weak scaling - mixed precisionWeak scaling -TF32
112864 , 64128 , 128655362871.891889.711.521.001.00
412864 , 6432 , 3265536111597532.001.483.893.99
1612864 , 648 , 8655364114428605.621.4414.3315.14
3212864 , 644 , 46553677479.8753585.821.4526.9828.36
151216 , 8256 , 51232768572.112316.511.811.001.00
451216 , 8128 , 128655362197.441268.431.733.844.01
1651216 , 832 , 32655368723.14903.391.7815.2515.49
3251216 , 816 , 1665536167059463.801.7729.2029.90

Note: The respective values for TF32 runs that use a batch size of 16 for sequence length 512 are not available due to out of memory errors that arise.

Fine-tuning training performance for SQuAD v1.1 on DGX A100 40GB

Our results were obtained by running the scripts/run_squad.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUsBatch size / GPU: mixed precision, TF32Throughput - mixed precisionThroughput - TF32Throughput speedup (TF32 to mixed precision)Weak scaling - TF32Weak scaling - mixed precision
132, 16102.2661.3641.671.001.00
432, 16366.353223.1871.643.643.58
832, 16767.071440.471.747.187.50

Note: The respective values for TF32 runs that use a batch size of 32 are not available due to out of memory errors that arise.

To achieve these same results, follow the Quick Start Guide outlined above.

Inference performance results

Inference performance: NVIDIA DGX-1 (1x V100 16GB)
Fine-tuning inference performance for SQuAD v1.1 on 16GB

Our results were obtained by running the scripts/finetune_inference_benchmark.sh script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

ModelSequence LengthBatch SizePrecisionThroughput-Average(sent/sec)Latency-Average(ms)Latency-90%(ms)Latency-95%(ms)Latency-99%(ms)
base1281fp16206.827.964.985.045.23
base1282fp16376.758.685.425.495.64
base1284fp1663512.316.466.556.83
base1288fp16962.8313.648.478.568.75
base3841fp16167.0112.776.126.236.52
base3842fp16252.1221.058.038.098.61
base3844fp16341.9525.0911.8811.9612.52
base3848fp16421.2633.1619.219.3719.91
base1281fp32174.488.175.895.956.12
base1282fp32263.6710.337.667.697.92
base1284fp32349.3416.3111.5711.6211.87
base1288fp32422.8823.2719.2319.3820.38
base3841fp3299.5214.9910.1910.2310.78
base3842fp32118.0125.9817.1217.1817.78
base3844fp32128.14131.5631.732.39
base3848fp32136.169.7759.4459.6660.51
large1281fp1698.6315.8610.2710.3110.46
large1282fp16172.5917.7811.8111.8612.13
large1284fp16272.8625.6614.8614.9415.18
large1288fp16385.6430.7420.9821.121.68
large3841fp1670.7426.8514.3814.4714.7
large3842fp1699.945.2920.2620.4321.11
large3844fp16128.4256.9431.4431.7132.45
large3848fp16148.5781.6954.2354.5455.53
large1281fp3276.7517.0613.2113.2713.4
large1282fp32100.8224.3420.0520.1321.13
large1284fp32117.5941.7634.4234.5535.29
large1288fp32130.4268.596262.2362.98
large3841fp3233.9537.8929.8229.9830.56
large3842fp3238.4768.3552.5652.7453.89
large3844fp3241.11114.2798.1998.5499.54
large3848fp3241.32213.84194.92195.36196.94

To achieve these same results, follow the Quick Start Guide outlined above.

Inference performance: NVIDIA DGX-1 (1x V100 32GB)
Fine-tuning inference performance for SQuAD v1.1 on 32GB

Our results were obtained by running the scripts/finetune_inference_benchmark.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 32GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

ModelSequence LengthBatch SizePrecisionThroughput-Average(sent/sec)Latency-Average(ms)Latency-90%(ms)Latency-95%(ms)Latency-99%(ms)
base1281fp16207.877.634.945.035.32
base1282fp16376.448.475.445.55.68
base1284fp16642.5511.636.36.366.68
base1288fp16943.8513.248.568.688.92
base3841fp16162.6212.246.316.46.73
base3842fp16244.1520.058.348.418.93
base3844fp16338.6823.5311.8811.9212.63
base3848fp16407.4632.7219.8420.0620.89
base1281fp32175.168.315.855.896.04
base1282fp32261.3110.487.757.818.08
base1284fp32339.4516.6711.9512.0212.46
base1288fp32406.6724.1219.8619.9720.41
base3841fp3298.3315.2810.2710.3210.76
base3842fp32114.9226.8817.5517.5918.29
base3844fp32125.7641.7432.0632.2333.72
base3848fp32136.6269.7858.9559.1960
large1281fp1696.4615.5610.5610.6611.02
large1282fp16168.3117.4212.1112.2512.57
large1284fp16267.7624.7615.1715.3616.68
large1288fp16378.2830.3421.3921.5421.97
large3841fp1668.7526.0214.7714.9415.3
large3842fp1695.4144.0121.2421.4722.01
large3844fp16124.4355.1432.5332.8333.58
large3848fp16143.0281.3756.5156.8858.05
large1281fp3275.3417.513.4613.5213.7
large1282fp3299.7324.720.2720.3821.45
large1284fp32116.9242.134.4934.5934.98
large1288fp32130.1168.9562.0362.2363.3
large3841fp3233.8438.1529.7529.8931.23
large3842fp3238.0269.3153.153.3654.42
large3844fp3241.2114.3497.9698.3299.55
large3848fp3242.37209.16190.18190.66192.77

To achieve these same results, follow the Quick Start Guide outlined above.

Inference performance: NVIDIA DGX-2 (1x V100 32GB)
Fine-tuning inference performance for SQuAD v1.1 on DGX-2 32GB

Our results were obtained by running the scripts/finetune_inference_benchmark.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-2 with 1x V100 32GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

ModelSequence LengthBatch SizePrecisionThroughput-Average(sent/sec)Latency-Average(ms)Latency-90%(ms)Latency-95%(ms)Latency-99%(ms)
base1281fp16220.357.824.74.835.15
base1282fp16384.558.75.495.686.01
base1284fp16650.736.36.356.516.87
base1288fp16992.4113.598.228.378.96
base3841fp16172.8912.865.946.046.44
base3842fp16258.4820.427.898.099.15
base3844fp16346.3424.9311.9712.1212.76
base3848fp16430.433.0818.7519.2720.12
base1281fp32183.697.525.865.976.27
base1282fp32282.959.517.317.497.83
base1284fp32363.8315.1211.3511.4711.74
base1288fp32449.1221.651818.118.6
base3841fp32104.9213.89.99.9910.48
base3842fp32123.5524.2116.2916.417.61
base3844fp32139.3836.6928.8929.0430.01
base3848fp32146.2864.6955.0955.3256.3
large1281fp1698.3415.8510.6110.7811.5
large1282fp16172.9517.811.9112.0812.42
large1284fp16278.8225.1814.714.8715.65
large1288fp16402.2830.4520.2120.4321.24
large3841fp1671.126.5514.4414.6115.32
large3842fp16100.4844.0420.3120.4821.6
large3844fp16131.6856.1930.831.0332.3
large3848fp16151.8181.5353.2253.8755.34
large1281fp3277.8716.3313.3313.4513.77
large1282fp32105.4122.7719.3919.5219.86
large1284fp32124.1638.6132.6932.8833.9
large1288fp32137.6964.6158.6258.8959.94
large3841fp3236.3434.9427.7227.8128.21
large3842fp3241.1162.5449.1449.3250.25
large3844fp3243.32107.5393.0793.4794.27
large3848fp3244.64196.28180.21180.75182.41
Inference performance: NVIDIA DGX A100 (1x A100 40GB)
Fine-tuning inference performance for SQuAD v1.1 on DGX A100 40GB

Our results were obtained by running the scripts/finetune_inference_benchmark.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 1x A100 40GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

ModelSequence LengthBatch SizePrecisionThroughput-Average(sent/sec)Latency-Average(ms)Latency-90%(ms)Latency-95%(ms)Latency-99%(ms)
base1281fp16231.376.434.574.684.93
base1282fp16454.546.774.664.774.96
base1284fp16842.348.84.914.985.39
base1288fp161216.4310.396.776.867.28
base3841fp16210.599.034.834.865.06
base3842fp16290.9114.887.097.197.72
base3844fp16407.1318.049.9310.0510.74
base3848fp16478.6726.0616.9217.1917.76
base1281tf32223.386.944.734.865.04
base1282tf32447.577.24.684.825.07
base1284tf32838.899.164.884.935.38
base1288tf321201.0510.816.886.997.21
base3841tf32206.469.744.934.985.25
base3842tf3228715.577.187.277.87
base3844tf32396.5918.9410.310.4111.04
base3848tf32479.0426.8116.8817.2517.74
base1281fp32152.929.136.766.917.06
base1282fp32297.429.516.937.077.21
base1284fp32448.5711.819.129.259.68
base1288fp32539.9417.491515.115.79
base3841fp32115.1913.698.898.989.27
base3842fp32154.6618.4913.0613.1413.89
base3844fp32174.2828.7523.1123.2424
base3848fp32191.9748.0541.8542.2542.8
large1281fp16127.7511.188.148.258.53
large1282fp16219.4912.769.49.549.89
large1284fp16315.8319.0112.8712.9813.37
large1288fp16495.7522.2116.3316.4516.79
large3841fp1696.6517.4610.5210.611
large3842fp16126.0729.4316.0916.2216.78
large3844fp16165.2138.3924.4124.6125.38
large3848fp16182.1361.0444.3244.6145.23
large1281tf32133.2410.867.777.878.23
large1282tf32218.1312.869.449.569.85
large1284tf32316.2518.9812.9113.0113.57
large1288tf32495.2122.2516.416.5117.23
large3841tf3295.4317.510.7210.8311.49
large3842tf32125.9929.4716.0616.1516.67
large3844tf32164.2838.7724.624.8325.59
large3848tf32182.466144.244.4645.15
large1281fp3250.4323.8320.1120.220.56
large1282fp3294.4725.5321.3621.4921.78
large1284fp32141.5232.5128.4428.5728.99
large1288fp32166.3752.0748.348.4349.46
large3841fp3244.4230.5422.6722.7423.46
large3842fp3250.2948.7439.9540.0640.59
large3844fp3255.5881.5572.3172.673.7
large3848fp3258.38147.63137.43137.82138.3

To achieve these same results, follow the Quick Start Guide outlined above.

Inference performance: NVIDIA Tesla T4 (1x T4 16GB)
Fine-tuning inference performance for SQuAD v1.1 on Tesla T4 16GB

Our results were obtained by running the scripts/finetune_inference_benchmark.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA Tesla T4 with 1x T4 16GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

ModelSequence LengthBatch SizePrecisionThroughput-Average(sent/sec)Latency-Average(ms)Latency-50%(ms)Latency-90%(ms)Latency-95%(ms)Latency-99%(ms)Latency-100%(ms)
base1281fp1691.9313.9410.9311.4111.5211.945491.47
base1282fp16148.0816.9113.6513.9514.0614.745757.12
base1284fp16215.4524.5618.6818.9219.0819.845894.82
base1288fp16289.5233.0727.7728.2228.3829.166074.47
base3841fp1660.7523.1816.616.9317.0317.457006.41
base3842fp1682.8537.0524.2624.5424.6325.677529.94
base3844fp1697.7854.441.0241.5341.9443.917995.39
base3848fp16106.7889.674.9875.576.1378.028461.93
base1281fp3254.2820.8818.5218.818.9219.294401.4
base1282fp3271.7530.5728.0828.5128.6229.124573.47
base1284fp3288.0150.3745.6145.9446.1447.044992.7
base1288fp3298.9285.5780.9881.4481.7482.755408.97
base3841fp3225.8343.6338.7539.3339.4340.025148.45
base3842fp3229.0877.6868.8969.2669.5572.085462.5
base3844fp3230.33141.45131.86132.53133.14136.75975.63
base3848fp3231.8262.88251.62252.23253.08255.567124
large1281fp1640.3130.6125.1425.6225.8727.6110395.87
large1282fp1663.7937.4331.6632.3132.6634.3610302.2
large1284fp1687.456.545.9746.647.0148.7110391.17
large1288fp16107.584.2974.5975.2575.6477.7310945.1
large3841fp1623.0555.7343.7244.2844.7446.812889.23
large3842fp1629.5991.6167.9468.869.4571.6413876.35
large3844fp1634.27141.56116.67118.02119.1122.114570.73
large3848fp1638.29237.85208.95210.08211.33214.6116626.02
large1281fp3221.5250.4646.4847.6347.9449.637150.38
large1282fp3225.483.379.0679.6180.0681.777763.11
large1284fp3228.19149.49142.15143.1143.65145.437701.38
large1288fp3230.14272.84265.6266.57267.21269.378246.3
large3841fp328.46126.81118.44119.42120.31122.749007.96
large3842fp329.29231215.54216.64217.71220.359755.69
large3844fp329.55436.5418.71420.05421.27424.311766.45
large3848fp329.75840.9820.39822.19823.69827.9912856.99

To achieve these same results, follow the Quick Start Guide outlined above.

NVIDIA uses cookies to improve your experience on our web site. We and our third-party partners also use cookies and other tools to collect and record information you provide as well as information about your interactions with our websites for performance improvement, analytics, and to assist in marketing efforts. By clicking "Accept All", you consent to our use of cookies and other tools as described in our Cookie Policy. You can manage your cookie settings by clicking on "Manage Settings." By continuing to use this site or by clicking one of the buttons below, you agree to our Terms of Service (which contains important waivers). Please see our Privacy Policy for more information on our privacy practices.