NGC | Catalog
CatalogResourcesBERT for PyTorch

BERT for PyTorch

For downloads and more information, please view on a desktop device.
Logo for BERT for PyTorch

Description

BERT is a method of pre-training language representations which obtains state-of-the-art results on a wide array of NLP tasks.

Publisher

NVIDIA Deep Learning Examples

Latest Version

21.11.0

Modified

April 4, 2023

Compressed Size

66.59 KB

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

Training performance benchmarks for pre-training can be obtained by running scripts/run_pretraining.sh, and for fine-tuning can be obtained by running scripts/run_squad.sh or scripts/run_glue.sh for SQuAD or GLUE, respectively. The required parameters can be passed through the command-line as described in Training process.

As an example, to benchmark the training performance on a specific batch size for SQuAD, run: bash scripts/run_squad.sh <pre-trained checkpoint path> <epochs> <batch size> <learning rate> <fp16|fp32> <num_gpus> <seed> <path to SQuAD dataset> <path to vocab set> <results directory> train <BERT config path] <max steps>

An example call used to generate throughput numbers: bash scripts/run_squad.sh /workspace/bert/bert_large_uncased.pt 2.0 4 3e-5 fp16 8 42 /workspace/bert/squad_data /workspace/bert/scripts/vocab/vocab /results/SQuAD train /workspace/bert/bert_config.json -1

Inference performance benchmark

Inference performance benchmarks for both fine-tuning can be obtained by running scripts/run_squad.sh and scripts/run_glue.sh respectively. The required parameters can be passed through the command-line as described in Inference process.

As an example, to benchmark the inference performance on a specific batch size for SQuAD, run: bash scripts/run_squad.sh <pre-trained checkpoint path> <epochs> <batch size> <learning rate> <fp16|fp32> <num_gpus> <seed> <path to SQuAD dataset> <path to vocab set> <results directory> eval <BERT config path> <max steps>

An example call used to generate throughput numbers: bash scripts/run_squad.sh /workspace/bert/bert_large_uncased.pt 2.0 4 3e-5 fp16 8 42 /workspace/bert/squad_data /workspace/bert/scripts/vocab/vocab /results/SQuAD eval /workspace/bert/bert_config.json -1

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference.

Training accuracy results

Our results were obtained by running the scripts/run_squad.sh and scripts/run_pretraining.sh training scripts in the pytorch:21.11-py3 NGC container unless otherwise specified.

Pre-training loss results: NVIDIA DGX A100 (8x A100 80GB)
DGX System GPUs / Node Batch size / GPU (Phase 1 and Phase 2) Accumulated Batch size / GPU (Phase 1 and Phase 2) Accumulation steps (Phase 1 and Phase 2) Final Loss - TF32 Final Loss - mixed precision Time to train(hours) - TF32 Time to train(hours) - mixed precision Time to train speedup (TF32 to mixed precision)
32 x DGX A100 80GB 8 256 and 32 256 and 128 1 and 4 --- 1.2437 --- 1.2 1.9
32 x DGX A100 80GB 8 128 and 16 256 and 128 2 and 8 1.2465 --- 2.4 --- ---
Pre-training loss curves

Pre-training Loss Curves

Fine-tuning accuracy results: NVIDIA DGX A100 (8x A100 80GB)
  • SQuAD
GPUs Batch size / GPU (TF32 and FP16) Accuracy - TF32(% F1) Accuracy - mixed precision(% F1) Time to train(hours) - TF32 Time to train(hours) - mixed precision Time to train speedup (TF32 to mixed precision)
8 32 90.93 90.96 0.102 0.0574 1.78
  • MRPC
GPUs Batch size / GPU (TF32 and FP16) Accuracy - TF32(%) Accuracy - mixed precision(%) Time to train(seconds) - TF32 Time to train(seconds) - mixed precision Time to train speedup (TF32 to mixed precision)
8 16 87.25 88.24 17.26 7.31 2.36
  • SST-2
GPUs Batch size / GPU (TF32 and FP16) Accuracy - TF32(%) Accuracy - mixed precision(%) Time to train(seconds) - TF32 Time to train(seconds) - mixed precision Time to train speedup (TF32 to mixed precision)
8 128 91.97 92.78 119.28 62.59 1.91
Training stability test
Pre-training stability test
Accuracy Metric Seed 0 Seed 1 Seed 2 Seed 3 Seed 4 Mean Standard Deviation
Final Loss 1.260 1.265 1.304 1.256 1.242 1.265 0.023
Fine-tuning stability test
  • SQuAD

Training stability with 8 GPUs, FP16 computations, batch size of 4:

Accuracy Metric Seed 0 Seed 1 Seed 2 Seed 3 Seed 4 Seed 5 Seed 6 Seed 7 Seed 8 Seed 9 Mean Standard Deviation
Exact Match % 83.64 84.05 84.51 83.69 83.87 83.94 84.27 83.97 83.75 83.92 83.96 0.266
f1 % 90.60 90.65 90.96 90.44 90.58 90.78 90.81 90.82 90.51 90.68 90.68 0.160
  • MRPC

Training stability with 8 A100 GPUs, FP16 computations, batch size of 16 per GPU:

Accuracy Metric Seed 0 Seed 1 Seed 2 Seed 3 Seed 4 Seed 5 Seed 6 Seed 7 Seed 8 Seed 9 Mean Standard Deviation
Exact Match % 85.78 85.54 84.56 86.27 84.07 86.76 87.01 85.29 88.24 86.52 86.00 1.225

Note: Since MRPC is a very small dataset where overfitting can often occur, the resulting validation accuracy can often have high variance. By repeating the above experiments for 100 seeds, the max accuracy is 88.73, and the average accuracy is 82.56 with a standard deviation of 6.01.

  • SST-2

Training stability with 8 A100 GPUs, FP16 computations, batch size of 128 per GPU:

Accuracy Metric Seed 0 Seed 1 Seed 2 Seed 3 Seed 4 Seed 5 Seed 6 Seed 7 Seed 8 Seed 9 Mean Standard Deviation
Exact Match % 91.86 91.28 91.86 91.74 91.28 91.86 91.40 91.97 91.40 92.78 91.74 0.449

Training performance results

Training performance: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the scripts run_pretraining.sh training script in the pytorch:21.11-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers (in items/images per second) were averaged over a few training iterations.

Pre-training NVIDIA DGX A100 (8x A100 80GB)
GPUs Batch size / GPU (TF32 and FP16) Accumulated Batch size / GPU (TF32 and FP16) Accumulation steps (TF32 and FP16) Sequence length Throughput - TF32(sequences/sec) Throughput - mixed precision(sequences/sec) Throughput speedup (TF32 - mixed precision) Weak scaling - TF32 Weak scaling - mixed precision
1 128 and 256 8192 and 8192 64 and 32 128 317 580 1.83 1.00 1.00
8 128 and 256 8192 and 8192 64 and 32 128 2505 4591 1.83 7.90 7.91
1 16 and 32 4096 and 4096 256 and 128 512 110 210 1.90 1.00 1.00
8 16 and 32 4096 and 4096 256 and 128 512 860 1657 1.92 7.81 7.89
Pre-training NVIDIA DGX A100 (8x A100 80GB) Multi-node Scaling
Nodes GPUs / node Batch size / GPU (TF32 and FP16) Accumulated Batch size / GPU (TF32 and FP16) Accumulation steps (TF32 and FP16) Sequence length Mixed Precision Throughput Mixed Precision Strong Scaling TF32 Throughput TF32 Strong Scaling Speedup (Mixed Precision to TF32)
1 8 126 and 256 8192 and 8192 64 and 32 128 4553 1 2486 1 1.83
2 8 126 and 256 4096 and 4096 32 and 16 128 9191 2.02 4979 2.00 1.85
4 8 126 and 256 2048 and 2048 16 and 18 128 18119 3.98 9859 3.97 1.84
8 8 126 and 256 1024 and 1024 8 and 4 128 35774 7.86 19815 7.97 1.81
16 8 126 and 256 512 and 512 4 and 2 128 70555 15.50 38866 15.63 1.82
32 8 126 and 256 256 and 256 2 and 1 128 138294 30.37 75706 30.45 1.83
1 8 16 and 32 4096 and 4096 256 and 128 512 1648 1 854 1 1.93
2 8 16 and 32 2048 and 2048 128 and 64 512 3291 2.00 1684 1.97 1.95
4 8 16 and 32 1024 and 1024 64 and 32 512 6464 3.92 3293 3.86 1.96
8 8 16 and 32 512 and 512 32 and 16 512 13005 7.89 6515 7.63 2.00
16 8 16 and 32 256 and 256 16 and 8 512 25570 15.51 12131 14.21 2.11
32 8 16 and 32 128 and 128 8 and 4 512 49663 30.13 21298 24.95 2.33
Fine-tuning NVIDIA DGX A100 (8x A100 80GB)
  • SQuAD
GPUs Batch size / GPU (TF32 and FP16) Throughput - TF32(sequences/sec) Throughput - mixed precision(sequences/sec) Throughput speedup (TF32 - mixed precision) Weak scaling - TF32 Weak scaling - mixed precision
1 32 and 32 61.5 110.5 1.79 1.00 1.00
8 32 and 32 469.8 846.7 1.80 7.63 7.66
Training performance: NVIDIA DGX-1 (8x V100 32G)

Our results were obtained by running the scripts/run_pretraining.sh and scripts/run_squad.sh training scripts in the pytorch:21.11-py3 NGC container on NVIDIA DGX-1 with (8x V100 32G) GPUs. Performance numbers (in sequences per second) were averaged over a few training iterations.

Pre-training NVIDIA DGX-1 With 32G
GPUs Batch size / GPU (FP32 and FP16) Accumulation steps (FP32 and FP16) Sequence length Throughput - FP32(sequences/sec) Throughput - mixed precision(sequences/sec) Throughput speedup (FP32 - mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
1 4096 and 4096 128 and 64 128 50 224 4.48 1.00 1.00
8 4096 and 4096 128 and 64 128 387 1746 4.51 7.79 7.79
1 2048 and 2048 512 and 256 512 19 75 3.94 1.00 1.00
8 2048 and 2048 512 and 256 512 149.6 586 3.92 7.87 7.81
Fine-tuning NVIDIA DGX-1 With 32G
  • SQuAD
GPUs Batch size / GPU (FP32 and FP16) Throughput - FP32(sequences/sec) Throughput - mixed precision(sequences/sec) Throughput speedup (FP32 - mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
1 8 and 16 12 52 4.33 1.00 1.00
8 8 and 16 85.5 382 4.47 7.12 7.34

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance results

Inference performance: NVIDIA DGX A100 (1x A100 80GB)

Our results were obtained by running scripts/run_squad.sh in the pytorch:21.11-py3 NGC container on NVIDIA DGX A100 with (1x A100 80G) GPUs.

Fine-tuning inference on NVIDIA DGX A100 (1x A100 80GB)
  • SQuAD
GPUs Batch Size (TF32/FP16) Sequence Length Throughput - TF32(sequences/sec) Throughput - Mixed Precision(sequences/sec)
1 32/32 384 216 312

To achieve these same results, follow the steps in the Quick Start Guide.

The inference performance metrics used were sequences/second.