BERT for TensorFlow1

NVIDIA Deep Learning Examples

Resource

NVIDIA Deep Learning Examples

BERT for TensorFlow1

BERT is a method of pre-training language representations which obtains state-of-the-art results on a wide array of NLP tasks.

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Both of these benchmarking scripts enable you to run a number of epochs, extract performance numbers, and run the BERT model for fine tuning.

Training performance benchmark

Training benchmarking can be performed by running the script:

scripts/finetune_train_benchmark.sh <bert_model> <use_xla> <num_gpu> squad

This script runs 2 epochs by default on the SQuAD v1.1 dataset and extracts performance numbers for various batch sizes and sequence lengths in both FP16 and FP32/TF32. These numbers are saved at /results/squad_train_benchmark_bert_<bert_model>_gpu_<num_gpu>.log.

Inference performance benchmark

Inference benchmarking can be performed by running the script:

scripts/finetune_inference_benchmark.sh squad

This script runs 1024 eval iterations by default on the SQuAD v1.1 dataset and extracts performance and latency numbers for various batch sizes and sequence lengths in both FP16 and FP32/TF32, for base and large models. These numbers are saved at /results/squad_inference_benchmark_bert_<bert_model>.log.

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference for pre-training using LAMB optimizer as well as fine tuning for Question Answering. All results are on BERT-large model unless otherwise mentioned. All fine tuning results are on SQuAD v1.1 using a sequence length of 384 unless otherwise mentioned.

Training accuracy results

Training accuracy

Pre-training accuracy

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 20.06-py3 NGC container.

DGX System	Nodes x GPUs	Precision	Batch Size/GPU: Phase1, Phase2	Accumulation Steps: Phase1, Phase2	Time to Train (Hrs)	Final Loss
DGX2H	32 x 16	FP16	64, 8	2, 8	2.63	1.59
DGX2H	32 x 16	FP32	32, 8	4, 8	8.48	1.56
DGXA100	32 x 8	FP16	64, 16	4, 8	3.24	1.56
DGXA100	32 x 8	TF32	64, 8	4, 16	4.58	1.58

Note: Time to train includes upto 16 minutes of start up time for every restart (atleast once for each phase). Experiments were run on clusters with a maximum wall clock time of 8 hours.

Fine-tuning accuracy for SQuAD v1.1: NVIDIA DGX A100 (8x A100 40G)

Our results were obtained by running the scripts/run_squad.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs.

GPUs	Batch size / GPU: TF32, FP16	Accuracy - TF32	Accuracy - mixed precision	Time to Train - TF32 (Hrs)	Time to Train - mixed precision (Hrs)
8	16, 24	91.41	91.52	0.26	0.26

Fine-tuning accuracy for GLUE MRPC: NVIDIA DGX A100 (8x A100 40G)

Our results were obtained by running the scripts/run_glue.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs for 10 different seeds and picking the maximum accuracy on MRPC dev set.

GPUs	Batch size / GPU	Accuracy - TF32	Accuracy - mixed precision	Time to Train - TF32 (Hrs)	Time to Train - mixed precision (Hrs)	Throughput - TF32	Throughput - mixed precision
8	16	87.99	87.09	0.009	0.009	357.91	230.16

Training stability test

Pre-training SQuAD v1.1 stability test: NVIDIA DGX A100 (256x A100 40GB)

The following tables compare Final Loss scores across 2 different training runs with different seeds, for both FP16 and TF32. The runs showcase consistent convergence on all 2 seeds with very little deviation.

FP16, 256x GPUs	seed 1	seed 2	mean	std
Final Loss	1.570	1.561	1.565	0.006

TF32, 256x GPUs	seed 1	seed 2	mean	std
Final Loss	1.583	1.582	1.582	0.0007

Fine-tuning SQuAD v1.1 stability test: NVIDIA DGX A100 (8x A100 40GB)

The following tables compare F1 scores across 5 different training runs with different seeds, for both FP16 and TF32 respectively using (Nvidia's Pretrained Checkpoint)[https://ngc.nvidia.com/catalog/models/nvidia:bert_tf_pretraining_lamb_16n]. The runs showcase consistent convergence on all 5 seeds with very little deviation.

FP16, 8x GPUs	seed 1	seed 2	seed 3	seed 4	seed 5	mean	std
F1	91.61	91.04	91.59	91.32	91.52	91.41	0.24

TF32, 8x GPUs	seed 1	seed 2	seed 3	seed 4	seed 5	mean	std
F1	91.50	91.49	91.64	91.29	91.67	91.52	0.15

Fine-tuning GLUE MRPC stability test: NVIDIA DGX A100 (8x A100 40GB)

The following tables compare F1 scores across 10 different training runs with different seeds, for both FP16 and TF32 respectively using (Nvidia's Pretrained Checkpoint)[https://ngc.nvidia.com/catalog/models/nvidia:bert_tf_pretraining_lamb_16n]. The runs showcase consistent convergence on all 10 seeds with very little deviation.

FP16, 8 GPUs	seed 1	seed 2	seed 3	seed 4	seed 5	seed 6	seed 7	seed 8	seed 9	seed 10	Mean	Std
Eval Accuracy	84.31372643	85.78431606	86.76471114	87.00980544	86.27451062	86.27451062	85.5392158	86.51961088	86.27451062	85.2941215	86.00490391	0.795887906

TF32, 8 GPUs	seed 1	seed 2	seed 3	seed 4	seed 5	seed 6	seed 7	seed 8	seed 9	seed 10	Mean	Std
Eval Accuracy	87.00980544	86.27451062	87.99020052	86.27451062	86.02941632	87.00980544	86.27451062	86.51961088	87.74510026	86.02941632	86.7156887	0.7009024515

Training performance results

Training performance: NVIDIA DGX-1 (8x V100 16GB)

Pre-training training performance: single-node on DGX-1 16GB

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUs	Sequence Length	Batch size / GPU: mixed precision, FP32	Gradient Accumulation: mixed precision, FP32	Global Batch Size	Throughput - mixed precision	Throughput - FP32	Throughput speedup (FP32 - mixed precision)	Weak scaling - mixed precision	Weak scaling - FP32
1	128	16 , 8	4096, 8192	65536	134.34	39.43	3.41	1.00	1.00
4	128	16 , 8	1024, 2048	65536	449.68	152.33	2.95	3.35	3.86
8	128	16 , 8	512, 1024	65536	1001.39	285.79	3.50	7.45	7.25
1	512	4 , 2	8192, 16384	32768	28.72	9.80	2.93	1.00	1.00
4	512	4 , 2	2048, 4096	32768	109.96	35.32	3.11	3.83	3.60
8	512	4 , 2	1024, 2048	32768	190.65	69.53	2.74	6.64	7.09

Note: The respective values for FP32 runs that use a batch size of 16, 4 in sequence lengths 128 and 512 respectively are not available due to out of memory errors that arise.

Fine-tuning training performance for SQuAD v1.1 on DGX-1 16GB

Our results were obtained by running the scripts/run_squad.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUs	Batch size / GPU: mixed precision, FP32	Throughput - mixed precision	Throughput - FP32	Throughput speedup (FP32 to mixed precision)	Weak scaling - FP32	Weak scaling - mixed precision
1	4,2	29.74	7.36	4.04	1.00	1.00
4	4,2	97.28	26.64	3.65	3.27	3.62
8	4,2	189.77	52.39	3.62	6.38	7.12

Note: The respective values for FP32 runs that use a batch size of 4 are not available due to out of memory errors that arise.

To achieve these same results, follow the Quick Start Guide outlined above.

Training performance: NVIDIA DGX-1 (8x V100 32GB)

Pre-training training performance: single-node on DGX-1 32GB

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUs	Sequence Length	Batch size / GPU: mixed precision, FP32	Gradient Accumulation: mixed precision, FP32	Global Batch Size	Throughput - mixed precision	Throughput - FP32	Throughput speedup (FP32 - mixed precision)	Weak scaling - mixed precision	Weak scaling - FP32
1	128	64 , 32	1024, 2048	65536	168.63	46.78	3.60	1.00	1.00
4	128	64 , 32	256, 512	65536	730.25	179.73	4.06	4.33	3.84
8	128	64 , 32	128, 256	65536	1443.05	357.00	4.04	8.56	7.63
1	512	8 , 8	4096, 4096	32768	31.23	10.67	2.93	1.00	1.00
4	512	8 , 8	1024, 1024	32768	118.84	39.55	3.00	3.81	3.71
8	512	8 , 8	512, 512	32768	255.64	81.42	3.14	8.19	7.63

Note: The respective values for FP32 runs that use a batch size of 64 in sequence lengths 128 are not available due to out of memory errors that arise.

Fine-tuning training performance for SQuAD v1.1 on DGX-1 32GB

Our results were obtained by running the scripts/run_squad.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUs	Batch size / GPU: mixed precision, FP32	Throughput - mixed precision	Throughput - FP32	Throughput speedup (FP32 to mixed precision)	Weak scaling - FP32	Weak scaling - mixed precision
1	24, 10	51.02	10.42	4.90	1.00	1.00
4	24, 10	181.37	39.77	4.56	3.55	3.82
8	24, 10	314.6	79.37	3.96	6.17	7.62

Note: The respective values for FP32 runs that use a batch size of 24 are not available due to out of memory errors that arise.

To achieve these same results, follow the Quick Start Guide outlined above.

Training performance: NVIDIA DGX-2 (16x V100 32GB)

Pre-training training performance: single-node on DGX-2 32GB

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-2 with 16x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUs	Sequence Length	Batch size / GPU: mixed precision, FP32	Gradient Accumulation: mixed precision, FP32	Global Batch Size	Throughput - mixed precision	Throughput - FP32	Throughput speedup (FP32 - mixed precision)	Weak scaling - mixed precision	Weak scaling - FP32
1	128	64 , 32	1024 , 8192	65536	188.04	35.32	5.32	1.00	1.00
4	128	64 , 32	256 , 2048	65536	790.89	193.08	4.10	4.21	5.47
8	128	64 , 32	128 , 1024	65536	1556.89	386.89	4.02	8.28	10.95
16	128	64 , 32	64 , 128	65536	3081.69	761.92	4.04	16.39	21.57
1	512	8 , 8	4096 , 4096	32768	35.32	11.67	3.03	1.00	1.00
4	512	8 , 8	1024 , 1024	32768	128.98	42.84	3.01	3.65	3.67
8	512	8 , 8	512 , 512	32768	274.04	86.78	3.16	7.76	7.44
16	512	8 , 8	256 , 256	32768	513.43	173.26	2.96	14.54	14.85

Note: The respective values for FP32 runs that use a batch size of 64 in sequence lengths 128 are not available due to out of memory errors that arise.

Pre-training training performance: multi-node on DGX-2H 32GB

Our results were obtained by running the run.sub training script in the TensorFlow 19.08-py3 NGC container using multiple NVIDIA DGX-2 with 16x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.

Num Nodes	Sequence Length	Batch size / GPU: mixed precision, FP32	Gradient Accumulation: mixed precision, FP32	Global Batch Size	Throughput - mixed precision	Throughput - FP32	Throughput speedup (FP32 - mixed precision)	Weak scaling - mixed precision	Weak scaling - FP32
1	128	64 , 32	64 , 128	65536	3081.69	761.92	4.04	1.00	1.00
4	128	64 , 32	16 , 32	65536	13192.00	3389.83	3.89	4.28	4.45
16	128	64 , 32	4 , 8	65536	48223.00	13217.78	3.65	15.65	17.35
32	128	64 , 32	2 , 4	65536	86673.64	25142.26	3.45	28.13	33.00
1	512	8 , 8	256 , 256	32768	577.79	173.26	3.33	1.00	1.00
4	512	8 , 8	64 , 64	32768	2284.23	765.04	2.99	3.95	4.42
16	512	8 , 8	16 , 16	32768	8853.00	3001.43	2.95	15.32	17.32
32	512	8 , 8	8 , 8	32768	17059.00	5893.14	2.89	29.52	34.01

Note: The respective values for FP32 runs that use a batch size of 64 in sequence lengths 128 are not available due to out of memory errors that arise.

Fine-tuning training performance for SQuAD v1.1 on DGX-2 32GB

Our results were obtained by running the scripts/run_squad.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-2 with 16x V100 32GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUs	Batch size / GPU: mixed precision, FP32	Throughput - mixed precision	Throughput - FP32	Throughput speedup (FP32 to mixed precision)	Weak scaling - FP32	Weak scaling - mixed precision
1	24, 10	55.28	11.15	4.96	1.00	1.00
4	24, 10	199.53	42.91	4.65	3.61	3.85
8	24, 10	341.55	85.08	4.01	6.18	7.63
16	24, 10	683.37	156.29	4.37	12.36	14.02

Note: The respective values for FP32 runs that use a batch size of 24 are not available due to out of memory errors that arise.

To achieve these same results, follow the Quick Start Guide outlined above.

Training performance: NVIDIA DGX A100 (8x A100 40GB)

Pre-training training performance: single-node on DGX A100 40GB

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUs	Sequence Length	Batch size / GPU: mixed precision, TF32	Gradient Accumulation: mixed precision, TF32	Global Batch Size	Throughput - mixed precision	Throughput - TF32	Throughput speedup (TF32 - mixed precision)	Weak scaling - mixed precision	Weak scaling -TF32
1	128	64 , 64	1024 , 1024	65536	356.845	238.10	1.50	1.00	1.00
4	128	64 , 64	256 , 256	65536	1422.25	952.39	1.49	3.99	4.00
8	128	64 , 64	128 , 128	65536	2871.89	1889.71	1.52	8.05	7.94
1	512	16 , 8	2048 , 4096	32768	70.856	39.96	1.77	1.00	1.00
4	512	16 , 8	512 , 1024	32768	284.912	160.16	1.78	4.02	4.01
8	512	16 , 8	256 , 512	32768	572.112	316.51	1.81	8.07	7.92

Note: The respective values for TF32 runs that use a batch size of 16 for sequence length 512 are not available due to out of memory errors that arise.

Pre-training training performance: multi-node on DGX A100 40GB

Num Nodes	Sequence Length	Batch size / GPU: mixed precision, TF32	Gradient Accumulation: mixed precision, TF32	Global Batch Size	Throughput - mixed precision	Throughput - TF32	Throughput speedup (TF32 - mixed precision)	Weak scaling - mixed precision	Weak scaling -TF32
1	128	64 , 64	128 , 128	65536	2871.89	1889.71	1.52	1.00	1.00
4	128	64 , 64	32 , 32	65536	11159	7532.00	1.48	3.89	3.99
16	128	64 , 64	8 , 8	65536	41144	28605.62	1.44	14.33	15.14
32	128	64 , 64	4 , 4	65536	77479.87	53585.82	1.45	26.98	28.36
1	512	16 , 8	256 , 512	32768	572.112	316.51	1.81	1.00	1.00
4	512	16 , 8	128 , 128	65536	2197.44	1268.43	1.73	3.84	4.01
16	512	16 , 8	32 , 32	65536	8723.1	4903.39	1.78	15.25	15.49
32	512	16 , 8	16 , 16	65536	16705	9463.80	1.77	29.20	29.90

Note: The respective values for TF32 runs that use a batch size of 16 for sequence length 512 are not available due to out of memory errors that arise.

Fine-tuning training performance for SQuAD v1.1 on DGX A100 40GB

Our results were obtained by running the scripts/run_squad.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs. Performance (in sentences per second) is the steady state throughput.

GPUs	Batch size / GPU: mixed precision, TF32	Throughput - mixed precision	Throughput - TF32	Throughput speedup (TF32 to mixed precision)	Weak scaling - TF32	Weak scaling - mixed precision
1	32, 16	102.26	61.364	1.67	1.00	1.00
4	32, 16	366.353	223.187	1.64	3.64	3.58
8	32, 16	767.071	440.47	1.74	7.18	7.50

Note: The respective values for TF32 runs that use a batch size of 32 are not available due to out of memory errors that arise.

To achieve these same results, follow the Quick Start Guide outlined above.

Inference performance results

Inference performance: NVIDIA DGX-1 (1x V100 16GB)

Fine-tuning inference performance for SQuAD v1.1 on 16GB

Our results were obtained by running the scripts/finetune_inference_benchmark.sh script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

Model	Sequence Length	Batch Size	Precision	Throughput-Average(sent/sec)	Latency-Average(ms)	Latency-90%(ms)	Latency-95%(ms)	Latency-99%(ms)
base	128	1	fp16	206.82	7.96	4.98	5.04	5.23
base	128	2	fp16	376.75	8.68	5.42	5.49	5.64
base	128	4	fp16	635	12.31	6.46	6.55	6.83
base	128	8	fp16	962.83	13.64	8.47	8.56	8.75
base	384	1	fp16	167.01	12.77	6.12	6.23	6.52
base	384	2	fp16	252.12	21.05	8.03	8.09	8.61
base	384	4	fp16	341.95	25.09	11.88	11.96	12.52
base	384	8	fp16	421.26	33.16	19.2	19.37	19.91

base	128	1	fp32	174.48	8.17	5.89	5.95	6.12
base	128	2	fp32	263.67	10.33	7.66	7.69	7.92
base	128	4	fp32	349.34	16.31	11.57	11.62	11.87
base	128	8	fp32	422.88	23.27	19.23	19.38	20.38
base	384	1	fp32	99.52	14.99	10.19	10.23	10.78
base	384	2	fp32	118.01	25.98	17.12	17.18	17.78
base	384	4	fp32	128.1	41	31.56	31.7	32.39
base	384	8	fp32	136.1	69.77	59.44	59.66	60.51

large	128	1	fp16	98.63	15.86	10.27	10.31	10.46
large	128	2	fp16	172.59	17.78	11.81	11.86	12.13
large	128	4	fp16	272.86	25.66	14.86	14.94	15.18
large	128	8	fp16	385.64	30.74	20.98	21.1	21.68
large	384	1	fp16	70.74	26.85	14.38	14.47	14.7
large	384	2	fp16	99.9	45.29	20.26	20.43	21.11
large	384	4	fp16	128.42	56.94	31.44	31.71	32.45
large	384	8	fp16	148.57	81.69	54.23	54.54	55.53

large	128	1	fp32	76.75	17.06	13.21	13.27	13.4
large	128	2	fp32	100.82	24.34	20.05	20.13	21.13
large	128	4	fp32	117.59	41.76	34.42	34.55	35.29
large	128	8	fp32	130.42	68.59	62	62.23	62.98
large	384	1	fp32	33.95	37.89	29.82	29.98	30.56
large	384	2	fp32	38.47	68.35	52.56	52.74	53.89
large	384	4	fp32	41.11	114.27	98.19	98.54	99.54
large	384	8	fp32	41.32	213.84	194.92	195.36	196.94

To achieve these same results, follow the Quick Start Guide outlined above.

Inference performance: NVIDIA DGX-1 (1x V100 32GB)

Fine-tuning inference performance for SQuAD v1.1 on 32GB

Our results were obtained by running the scripts/finetune_inference_benchmark.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 32GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

Model	Sequence Length	Batch Size	Precision	Throughput-Average(sent/sec)	Latency-Average(ms)	Latency-90%(ms)	Latency-95%(ms)	Latency-99%(ms)
base	128	1	fp16	207.87	7.63	4.94	5.03	5.32
base	128	2	fp16	376.44	8.47	5.44	5.5	5.68
base	128	4	fp16	642.55	11.63	6.3	6.36	6.68
base	128	8	fp16	943.85	13.24	8.56	8.68	8.92
base	384	1	fp16	162.62	12.24	6.31	6.4	6.73
base	384	2	fp16	244.15	20.05	8.34	8.41	8.93
base	384	4	fp16	338.68	23.53	11.88	11.92	12.63
base	384	8	fp16	407.46	32.72	19.84	20.06	20.89

base	128	1	fp32	175.16	8.31	5.85	5.89	6.04
base	128	2	fp32	261.31	10.48	7.75	7.81	8.08
base	128	4	fp32	339.45	16.67	11.95	12.02	12.46
base	128	8	fp32	406.67	24.12	19.86	19.97	20.41
base	384	1	fp32	98.33	15.28	10.27	10.32	10.76
base	384	2	fp32	114.92	26.88	17.55	17.59	18.29
base	384	4	fp32	125.76	41.74	32.06	32.23	33.72
base	384	8	fp32	136.62	69.78	58.95	59.19	60

large	128	1	fp16	96.46	15.56	10.56	10.66	11.02
large	128	2	fp16	168.31	17.42	12.11	12.25	12.57
large	128	4	fp16	267.76	24.76	15.17	15.36	16.68
large	128	8	fp16	378.28	30.34	21.39	21.54	21.97
large	384	1	fp16	68.75	26.02	14.77	14.94	15.3
large	384	2	fp16	95.41	44.01	21.24	21.47	22.01
large	384	4	fp16	124.43	55.14	32.53	32.83	33.58
large	384	8	fp16	143.02	81.37	56.51	56.88	58.05

large	128	1	fp32	75.34	17.5	13.46	13.52	13.7
large	128	2	fp32	99.73	24.7	20.27	20.38	21.45
large	128	4	fp32	116.92	42.1	34.49	34.59	34.98
large	128	8	fp32	130.11	68.95	62.03	62.23	63.3
large	384	1	fp32	33.84	38.15	29.75	29.89	31.23
large	384	2	fp32	38.02	69.31	53.1	53.36	54.42
large	384	4	fp32	41.2	114.34	97.96	98.32	99.55
large	384	8	fp32	42.37	209.16	190.18	190.66	192.77

To achieve these same results, follow the Quick Start Guide outlined above.

Inference performance: NVIDIA DGX-2 (1x V100 32GB)

Fine-tuning inference performance for SQuAD v1.1 on DGX-2 32GB

Our results were obtained by running the scripts/finetune_inference_benchmark.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX-2 with 1x V100 32GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

Model	Sequence Length	Batch Size	Precision	Throughput-Average(sent/sec)	Latency-Average(ms)	Latency-90%(ms)	Latency-95%(ms)	Latency-99%(ms)
base	128	1	fp16	220.35	7.82	4.7	4.83	5.15
base	128	2	fp16	384.55	8.7	5.49	5.68	6.01
base	128	4	fp16	650.7	36.3	6.35	6.51	6.87
base	128	8	fp16	992.41	13.59	8.22	8.37	8.96
base	384	1	fp16	172.89	12.86	5.94	6.04	6.44
base	384	2	fp16	258.48	20.42	7.89	8.09	9.15
base	384	4	fp16	346.34	24.93	11.97	12.12	12.76
base	384	8	fp16	430.4	33.08	18.75	19.27	20.12

base	128	1	fp32	183.69	7.52	5.86	5.97	6.27
base	128	2	fp32	282.95	9.51	7.31	7.49	7.83
base	128	4	fp32	363.83	15.12	11.35	11.47	11.74
base	128	8	fp32	449.12	21.65	18	18.1	18.6
base	384	1	fp32	104.92	13.8	9.9	9.99	10.48
base	384	2	fp32	123.55	24.21	16.29	16.4	17.61
base	384	4	fp32	139.38	36.69	28.89	29.04	30.01
base	384	8	fp32	146.28	64.69	55.09	55.32	56.3

large	128	1	fp16	98.34	15.85	10.61	10.78	11.5
large	128	2	fp16	172.95	17.8	11.91	12.08	12.42
large	128	4	fp16	278.82	25.18	14.7	14.87	15.65
large	128	8	fp16	402.28	30.45	20.21	20.43	21.24
large	384	1	fp16	71.1	26.55	14.44	14.61	15.32
large	384	2	fp16	100.48	44.04	20.31	20.48	21.6
large	384	4	fp16	131.68	56.19	30.8	31.03	32.3
large	384	8	fp16	151.81	81.53	53.22	53.87	55.34

large	128	1	fp32	77.87	16.33	13.33	13.45	13.77
large	128	2	fp32	105.41	22.77	19.39	19.52	19.86
large	128	4	fp32	124.16	38.61	32.69	32.88	33.9
large	128	8	fp32	137.69	64.61	58.62	58.89	59.94
large	384	1	fp32	36.34	34.94	27.72	27.81	28.21
large	384	2	fp32	41.11	62.54	49.14	49.32	50.25
large	384	4	fp32	43.32	107.53	93.07	93.47	94.27
large	384	8	fp32	44.64	196.28	180.21	180.75	182.41

Inference performance: NVIDIA DGX A100 (1x A100 40GB)

Fine-tuning inference performance for SQuAD v1.1 on DGX A100 40GB

Our results were obtained by running the scripts/finetune_inference_benchmark.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA DGX A100 with 1x A100 40GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

Model	Sequence Length	Batch Size	Precision	Throughput-Average(sent/sec)	Latency-Average(ms)	Latency-90%(ms)	Latency-95%(ms)	Latency-99%(ms)
base	128	1	fp16	231.37	6.43	4.57	4.68	4.93
base	128	2	fp16	454.54	6.77	4.66	4.77	4.96
base	128	4	fp16	842.34	8.8	4.91	4.98	5.39
base	128	8	fp16	1216.43	10.39	6.77	6.86	7.28
base	384	1	fp16	210.59	9.03	4.83	4.86	5.06
base	384	2	fp16	290.91	14.88	7.09	7.19	7.72
base	384	4	fp16	407.13	18.04	9.93	10.05	10.74
base	384	8	fp16	478.67	26.06	16.92	17.19	17.76

base	128	1	tf32	223.38	6.94	4.73	4.86	5.04
base	128	2	tf32	447.57	7.2	4.68	4.82	5.07
base	128	4	tf32	838.89	9.16	4.88	4.93	5.38
base	128	8	tf32	1201.05	10.81	6.88	6.99	7.21
base	384	1	tf32	206.46	9.74	4.93	4.98	5.25
base	384	2	tf32	287	15.57	7.18	7.27	7.87
base	384	4	tf32	396.59	18.94	10.3	10.41	11.04
base	384	8	tf32	479.04	26.81	16.88	17.25	17.74

base	128	1	fp32	152.92	9.13	6.76	6.91	7.06
base	128	2	fp32	297.42	9.51	6.93	7.07	7.21
base	128	4	fp32	448.57	11.81	9.12	9.25	9.68
base	128	8	fp32	539.94	17.49	15	15.1	15.79
base	384	1	fp32	115.19	13.69	8.89	8.98	9.27
base	384	2	fp32	154.66	18.49	13.06	13.14	13.89
base	384	4	fp32	174.28	28.75	23.11	23.24	24
base	384	8	fp32	191.97	48.05	41.85	42.25	42.8

large	128	1	fp16	127.75	11.18	8.14	8.25	8.53
large	128	2	fp16	219.49	12.76	9.4	9.54	9.89
large	128	4	fp16	315.83	19.01	12.87	12.98	13.37
large	128	8	fp16	495.75	22.21	16.33	16.45	16.79
large	384	1	fp16	96.65	17.46	10.52	10.6	11
large	384	2	fp16	126.07	29.43	16.09	16.22	16.78
large	384	4	fp16	165.21	38.39	24.41	24.61	25.38
large	384	8	fp16	182.13	61.04	44.32	44.61	45.23

large	128	1	tf32	133.24	10.86	7.77	7.87	8.23
large	128	2	tf32	218.13	12.86	9.44	9.56	9.85
large	128	4	tf32	316.25	18.98	12.91	13.01	13.57
large	128	8	tf32	495.21	22.25	16.4	16.51	17.23
large	384	1	tf32	95.43	17.5	10.72	10.83	11.49
large	384	2	tf32	125.99	29.47	16.06	16.15	16.67
large	384	4	tf32	164.28	38.77	24.6	24.83	25.59
large	384	8	tf32	182.46	61	44.2	44.46	45.15

large	128	1	fp32	50.43	23.83	20.11	20.2	20.56
large	128	2	fp32	94.47	25.53	21.36	21.49	21.78
large	128	4	fp32	141.52	32.51	28.44	28.57	28.99
large	128	8	fp32	166.37	52.07	48.3	48.43	49.46
large	384	1	fp32	44.42	30.54	22.67	22.74	23.46
large	384	2	fp32	50.29	48.74	39.95	40.06	40.59
large	384	4	fp32	55.58	81.55	72.31	72.6	73.7
large	384	8	fp32	58.38	147.63	137.43	137.82	138.3

To achieve these same results, follow the Quick Start Guide outlined above.

Inference performance: NVIDIA Tesla T4 (1x T4 16GB)

Fine-tuning inference performance for SQuAD v1.1 on Tesla T4 16GB

Our results were obtained by running the scripts/finetune_inference_benchmark.sh training script in the TensorFlow 20.06-py3 NGC container on NVIDIA Tesla T4 with 1x T4 16GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

Model	Sequence Length	Batch Size	Precision	Throughput-Average(sent/sec)	Latency-Average(ms)	Latency-50%(ms)	Latency-90%(ms)	Latency-95%(ms)	Latency-99%(ms)	Latency-100%(ms)
base	128	1	fp16	91.93	13.94	10.93	11.41	11.52	11.94	5491.47
base	128	2	fp16	148.08	16.91	13.65	13.95	14.06	14.74	5757.12
base	128	4	fp16	215.45	24.56	18.68	18.92	19.08	19.84	5894.82
base	128	8	fp16	289.52	33.07	27.77	28.22	28.38	29.16	6074.47
base	384	1	fp16	60.75	23.18	16.6	16.93	17.03	17.45	7006.41
base	384	2	fp16	82.85	37.05	24.26	24.54	24.63	25.67	7529.94
base	384	4	fp16	97.78	54.4	41.02	41.53	41.94	43.91	7995.39
base	384	8	fp16	106.78	89.6	74.98	75.5	76.13	78.02	8461.93

base	128	1	fp32	54.28	20.88	18.52	18.8	18.92	19.29	4401.4
base	128	2	fp32	71.75	30.57	28.08	28.51	28.62	29.12	4573.47
base	128	4	fp32	88.01	50.37	45.61	45.94	46.14	47.04	4992.7
base	128	8	fp32	98.92	85.57	80.98	81.44	81.74	82.75	5408.97
base	384	1	fp32	25.83	43.63	38.75	39.33	39.43	40.02	5148.45
base	384	2	fp32	29.08	77.68	68.89	69.26	69.55	72.08	5462.5
base	384	4	fp32	30.33	141.45	131.86	132.53	133.14	136.7	5975.63
base	384	8	fp32	31.8	262.88	251.62	252.23	253.08	255.56	7124

large	128	1	fp16	40.31	30.61	25.14	25.62	25.87	27.61	10395.87
large	128	2	fp16	63.79	37.43	31.66	32.31	32.66	34.36	10302.2
large	128	4	fp16	87.4	56.5	45.97	46.6	47.01	48.71	10391.17
large	128	8	fp16	107.5	84.29	74.59	75.25	75.64	77.73	10945.1
large	384	1	fp16	23.05	55.73	43.72	44.28	44.74	46.8	12889.23
large	384	2	fp16	29.59	91.61	67.94	68.8	69.45	71.64	13876.35
large	384	4	fp16	34.27	141.56	116.67	118.02	119.1	122.1	14570.73
large	384	8	fp16	38.29	237.85	208.95	210.08	211.33	214.61	16626.02

large	128	1	fp32	21.52	50.46	46.48	47.63	47.94	49.63	7150.38
large	128	2	fp32	25.4	83.3	79.06	79.61	80.06	81.77	7763.11
large	128	4	fp32	28.19	149.49	142.15	143.1	143.65	145.43	7701.38
large	128	8	fp32	30.14	272.84	265.6	266.57	267.21	269.37	8246.3
large	384	1	fp32	8.46	126.81	118.44	119.42	120.31	122.74	9007.96
large	384	2	fp32	9.29	231	215.54	216.64	217.71	220.35	9755.69
large	384	4	fp32	9.55	436.5	418.71	420.05	421.27	424.3	11766.45
large	384	8	fp32	9.75	840.9	820.39	822.19	823.69	827.99	12856.99

To achieve these same results, follow the Quick Start Guide outlined above.