NVIDIA Deep Learning Examples
NVIDIA Deep Learning Examples
BERT for PyTorch
Resource
NVIDIA Deep Learning Examples
NVIDIA Deep Learning Examples
BERT for PyTorch

BERT is a method of pre-training language representations which obtains state-of-the-art results on a wide array of NLP tasks.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

Training performance benchmarks for pre-training can be obtained by running scripts/run_pretraining.sh, and for fine-tuning can be obtained by running scripts/run_squad.sh or scripts/run_glue.sh for SQuAD or GLUE, respectively. The required parameters can be passed through the command-line as described in Training process.

As an example, to benchmark the training performance on a specific batch size for SQuAD, run: bash scripts/run_squad.sh <pre-trained checkpoint path> <epochs> <batch size> <learning rate> <fp16|fp32> <num_gpus> <seed> <path to SQuAD dataset> <path to vocab set> <results directory> train <BERT config path] <max steps>

An example call used to generate throughput numbers: bash scripts/run_squad.sh /workspace/bert/bert_large_uncased.pt 2.0 4 3e-5 fp16 8 42 /workspace/bert/squad_data /workspace/bert/scripts/vocab/vocab /results/SQuAD train /workspace/bert/bert_config.json -1

Inference performance benchmark

Inference performance benchmarks for both fine-tuning can be obtained by running scripts/run_squad.sh and scripts/run_glue.sh respectively. The required parameters can be passed through the command-line as described in Inference process.

As an example, to benchmark the inference performance on a specific batch size for SQuAD, run: bash scripts/run_squad.sh <pre-trained checkpoint path> <epochs> <batch size> <learning rate> <fp16|fp32> <num_gpus> <seed> <path to SQuAD dataset> <path to vocab set> <results directory> eval <BERT config path> <max steps>

An example call used to generate throughput numbers: bash scripts/run_squad.sh /workspace/bert/bert_large_uncased.pt 2.0 4 3e-5 fp16 8 42 /workspace/bert/squad_data /workspace/bert/scripts/vocab/vocab /results/SQuAD eval /workspace/bert/bert_config.json -1

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference.

Training accuracy results

Our results were obtained by running the scripts/run_squad.sh and scripts/run_pretraining.sh training scripts in the pytorch:21.11-py3 NGC container unless otherwise specified.

Pre-training loss results: NVIDIA DGX A100 (8x A100 80GB)
DGX SystemGPUs / NodeBatch size / GPU (Phase 1 and Phase 2)Accumulated Batch size / GPU (Phase 1 and Phase 2)Accumulation steps (Phase 1 and Phase 2)Final Loss - TF32Final Loss - mixed precisionTime to train(hours) - TF32Time to train(hours) - mixed precisionTime to train speedup (TF32 to mixed precision)
32 x DGX A100 80GB8256 and 32256 and 1281 and 4---1.2437---1.21.9
32 x DGX A100 80GB8128 and 16256 and 1282 and 81.2465---2.4------
Pre-training loss curves

Pre-training Loss Curves

Fine-tuning accuracy results: NVIDIA DGX A100 (8x A100 80GB)
  • SQuAD
GPUsBatch size / GPU (TF32 and FP16)Accuracy - TF32(% F1)Accuracy - mixed precision(% F1)Time to train(hours) - TF32Time to train(hours) - mixed precisionTime to train speedup (TF32 to mixed precision)
83290.9390.960.1020.05741.78
  • MRPC
GPUsBatch size / GPU (TF32 and FP16)Accuracy - TF32(%)Accuracy - mixed precision(%)Time to train(seconds) - TF32Time to train(seconds) - mixed precisionTime to train speedup (TF32 to mixed precision)
81687.2588.2417.267.312.36
  • SST-2
GPUsBatch size / GPU (TF32 and FP16)Accuracy - TF32(%)Accuracy - mixed precision(%)Time to train(seconds) - TF32Time to train(seconds) - mixed precisionTime to train speedup (TF32 to mixed precision)
812891.9792.78119.2862.591.91
Training stability test
Pre-training stability test
Accuracy MetricSeed 0Seed 1Seed 2Seed 3Seed 4MeanStandard Deviation
Final Loss1.2601.2651.3041.2561.2421.2650.023
Fine-tuning stability test
  • SQuAD

Training stability with 8 GPUs, FP16 computations, batch size of 4:

Accuracy MetricSeed 0Seed 1Seed 2Seed 3Seed 4Seed 5Seed 6Seed 7Seed 8Seed 9MeanStandard Deviation
Exact Match %83.6484.0584.5183.6983.8783.9484.2783.9783.7583.9283.960.266
f1 %90.6090.6590.9690.4490.5890.7890.8190.8290.5190.6890.680.160
  • MRPC

Training stability with 8 A100 GPUs, FP16 computations, batch size of 16 per GPU:

Accuracy MetricSeed 0Seed 1Seed 2Seed 3Seed 4Seed 5Seed 6Seed 7Seed 8Seed 9MeanStandard Deviation
Exact Match %85.7885.5484.5686.2784.0786.7687.0185.2988.2486.5286.001.225

Note: Since MRPC is a very small dataset where overfitting can often occur, the resulting validation accuracy can often have high variance. By repeating the above experiments for 100 seeds, the max accuracy is 88.73, and the average accuracy is 82.56 with a standard deviation of 6.01.

  • SST-2

Training stability with 8 A100 GPUs, FP16 computations, batch size of 128 per GPU:

Accuracy MetricSeed 0Seed 1Seed 2Seed 3Seed 4Seed 5Seed 6Seed 7Seed 8Seed 9MeanStandard Deviation
Exact Match %91.8691.2891.8691.7491.2891.8691.4091.9791.4092.7891.740.449

Training performance results

Training performance: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the scripts run_pretraining.sh training script in the pytorch:21.11-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers (in items/images per second) were averaged over a few training iterations.

Pre-training NVIDIA DGX A100 (8x A100 80GB)
GPUsBatch size / GPU (TF32 and FP16)Accumulated Batch size / GPU (TF32 and FP16)Accumulation steps (TF32 and FP16)Sequence lengthThroughput - TF32(sequences/sec)Throughput - mixed precision(sequences/sec)Throughput speedup (TF32 - mixed precision)Weak scaling - TF32Weak scaling - mixed precision
1128 and 2568192 and 819264 and 321283175801.831.001.00
8128 and 2568192 and 819264 and 32128250545911.837.907.91
116 and 324096 and 4096256 and 1285121102101.901.001.00
816 and 324096 and 4096256 and 12851286016571.927.817.89
Pre-training NVIDIA DGX A100 (8x A100 80GB) Multi-node Scaling
NodesGPUs / nodeBatch size / GPU (TF32 and FP16)Accumulated Batch size / GPU (TF32 and FP16)Accumulation steps (TF32 and FP16)Sequence lengthMixed Precision ThroughputMixed Precision Strong ScalingTF32 ThroughputTF32 Strong ScalingSpeedup (Mixed Precision to TF32)
18126 and 2568192 and 819264 and 3212845531248611.83
28126 and 2564096 and 409632 and 1612891912.0249792.001.85
48126 and 2562048 and 204816 and 18128181193.9898593.971.84
88126 and 2561024 and 10248 and 4128357747.86198157.971.81
168126 and 256512 and 5124 and 21287055515.503886615.631.82
328126 and 256256 and 2562 and 112813829430.377570630.451.83
1816 and 324096 and 4096256 and 1285121648185411.93
2816 and 322048 and 2048128 and 6451232912.0016841.971.95
4816 and 321024 and 102464 and 3251264643.9232933.861.96
8816 and 32512 and 51232 and 16512130057.8965157.632.00
16816 and 32256 and 25616 and 85122557015.511213114.212.11
32816 and 32128 and 1288 and 45124966330.132129824.952.33
Fine-tuning NVIDIA DGX A100 (8x A100 80GB)
  • SQuAD
GPUsBatch size / GPU (TF32 and FP16)Throughput - TF32(sequences/sec)Throughput - mixed precision(sequences/sec)Throughput speedup (TF32 - mixed precision)Weak scaling - TF32Weak scaling - mixed precision
132 and 3261.5110.51.791.001.00
832 and 32469.8846.71.807.637.66
Training performance: NVIDIA DGX-1 (8x V100 32G)

Our results were obtained by running the scripts/run_pretraining.sh and scripts/run_squad.sh training scripts in the pytorch:21.11-py3 NGC container on NVIDIA DGX-1 with (8x V100 32G) GPUs. Performance numbers (in sequences per second) were averaged over a few training iterations.

Pre-training NVIDIA DGX-1 With 32G
GPUsBatch size / GPU (FP32 and FP16)Accumulation steps (FP32 and FP16)Sequence lengthThroughput - FP32(sequences/sec)Throughput - mixed precision(sequences/sec)Throughput speedup (FP32 - mixed precision)Weak scaling - FP32Weak scaling - mixed precision
14096 and 4096128 and 64128502244.481.001.00
84096 and 4096128 and 6412838717464.517.797.79
12048 and 2048512 and 25651219753.941.001.00
82048 and 2048512 and 256512149.65863.927.877.81
Fine-tuning NVIDIA DGX-1 With 32G
  • SQuAD
GPUsBatch size / GPU (FP32 and FP16)Throughput - FP32(sequences/sec)Throughput - mixed precision(sequences/sec)Throughput speedup (FP32 - mixed precision)Weak scaling - FP32Weak scaling - mixed precision
18 and 1612524.331.001.00
88 and 1685.53824.477.127.34

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance results

Inference performance: NVIDIA DGX A100 (1x A100 80GB)

Our results were obtained by running scripts/run_squad.sh in the pytorch:21.11-py3 NGC container on NVIDIA DGX A100 with (1x A100 80G) GPUs.

Fine-tuning inference on NVIDIA DGX A100 (1x A100 80GB)
  • SQuAD
GPUsBatch Size (TF32/FP16)Sequence LengthThroughput - TF32(sequences/sec)Throughput - Mixed Precision(sequences/sec)
132/32384216312

To achieve these same results, follow the steps in the Quick Start Guide.

The inference performance metrics used were sequences/second.

NVIDIA uses cookies to improve your experience on our web site. We and our third-party partners also use cookies and other tools to collect and record information you provide as well as information about your interactions with our websites for performance improvement, analytics, and to assist in marketing efforts. By clicking "Accept All", you consent to our use of cookies and other tools as described in our Cookie Policy. You can manage your cookie settings by clicking on "Manage Settings." By continuing to use this site or by clicking one of the buttons below, you agree to our Terms of Service (which contains important waivers). Please see our Privacy Policy for more information on our privacy practices.