NGC | Catalog
Welcome Guest
CatalogResourcesELECTRA for TensorFlow2

ELECTRA for TensorFlow2

For downloads and more information, please view on a desktop device.
Logo for ELECTRA for TensorFlow2

Description

ELECTRA is method of pre-training language representations which outperforms existing techniques on a wide array of NLP tasks.

Publisher

NVIDIA

Use Case

Nlp

Framework

TensorFlow2

Latest Version

20.07.4

Modified

November 12, 2021

Compressed Size

1.03 MB

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

Training performance benchmarks for both pre-training phases can be obtained by running scripts/benchmark_pretraining.sh. Default parameters are set to run a few training steps for a converging NVIDIA DGX A100 system.

To benchmark training performance with other parameters, run:

bash scripts/benchmark_pretraining.sh <train_batch_size_p1> <amp|tf32|fp32> <xla|no_xla> <num_gpus> <accumulate_gradients=true|false> <gradient_accumulation_steps_p1> <train_batch_size_p2> <gradient_accumulation_steps_p2> <base> 

An example call used to generate throughput numbers:

bash scripts/benchmark_pretraining.sh 88 amp xla 8 true 2 12 4 base

Training performance benchmarks for fine-tuning can be obtained by running scripts/benchmark_squad.sh. The required parameters can be passed through the command-line as described in Training process. The performance information is printed after 200 training iterations.

To benchmark the training performance on a specific batch size, run:

bash scripts/benchmark_squad.sh train <num_gpus> <batch size> <infer_batch_size> <amp|tf32|fp32> <SQuAD version> <path to SQuAD dataset> <results directory> <checkpoint_to_load> <cache_Dir>

An example call used to generate throughput numbers:

bash scripts/benchmark_squad.sh train 8 16

Inference performance benchmark

Inference performance benchmarks fine-tuning can be obtained by running scripts/benchmark_squad.sh. The required parameters can be passed through the command-line as described in Inference process. This script runs one epoch by default on the SQuAD v1.1 dataset and extracts the average performance for the given configuration.

To benchmark the training performance on a specific batch size, run: bash scripts/benchmark_squad.sh train <num_gpus> <batch size> <infer_batch_size> <amp|fp32> <SQuAD version> <path to SQuAD dataset> <results directory> <checkpoint_to_load> <cache_Dir>

An example call used to generate throughput numbers: bash scripts/benchmark_squad.sh eval 8 256

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference. All results are on ELECTRA-base model and on SQuAD v1.1 dataset with a sequence length of 384 unless otherwise mentioned.

Training accuracy results

Pre-training loss curves

Pretraining Loss Curves

Phase 1 is shown by the blue curve and Phase 2 by the grey. Y axis stands for the total loss and x axis for the total steps trained.

Pre-training loss results
DGX System GPUs Batch size / GPU (Phase 1 and Phase 2) Accumulation steps (Phase 1 and Phase 2) Final Loss - TF32/FP32 Final Loss - mixed precision Time to train(hours) - TF32/FP32 Time to train(hours) - mixed precision Time to train speedup (TF32/FP32 to mixed precision)
48 x DGX A100 8 176 and 24 1 and 3 8.686 8.68 1.61 1.126 1.43
24 x DGX-2H 16 176 and 24 1 and 3 8.72 8.67 5.58 1.74 3.20
1 x DGX A100 8 176 and 24 48 and 144 - - 54.84 30.47 1.8
1 x DGX-1 16G 8 88 and 12 96 and 288 - - 241.8 65.1 3.71
1 x DGX-2 32G 16 176 and 24 24 and 72 - - 109.97 29.08 3.78

In the above table, FP32 and TF32 runs were made at half the batch per GPU and twice the gradient accumulation steps of a run with mixed precision in order to not run out of memory.

The SQuAD fine-tuning scripts by default train on Google's ELECTRA++ base pretrained checkpoint which uses around 10x training dataset (dataset used by XLNet authors) and greater than 5x training steps compared to the training recipe in scripts/run_pretraining.sh. The latter trains and achieves state-of-the-art accuracy on Wikipedia and BookCorpus datasets only.

Fine-tuning accuracy: NVIDIA DGX A100 (8x A100 40GB)

Our results were obtained by running the scripts/run_squad.sh training script in the tensorflow:20.07-tf2-py3 NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs.

ELECTRA BASE++

GPUs Batch size / GPU Accuracy / F1 - FP32 Accuracy / F1 - mixed precision Time to train - TF32 (sec) Time to train - mixed precision (sec) Time to train speedup (FP32 to mixed precision)
1 32 87.19 / 92.85 87.19 / 92.84 1699 749 2.27
8 32 86.84 / 92.57 86.83 / 92.56 263 201 1.30
Fine-tuning accuracy: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running the scripts/run_squad.sh training script in the tensorflow:20.07-tf2-py3 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.

ELECTRA BASE++

GPUs Batch size / GPU (FP32 : mixed precision) Accuracy / F1 - FP32 Accuracy / F1 - mixed precision Time to train - FP32 (sec) Time to train - mixed precision (sec) Time to train speedup (FP32 to mixed precision)
1 8 : 16 87.36 / 92.82 87.32 / 92.74 5136 1378 3.73
8 8 : 16 87.02 / 92.73 87.02 / 92.72 730 334 2.18

ELECTRA BASE checkpoint Wikipedia and BookCorpus

GPUs SQuAD version Batch size / GPU (FP32 : mixed precision) Accuracy / F1 - FP32 Accuracy / F1 - mixed precision Time to train - FP32 (sec) Time to train - mixed precision (sec) Time to train speedup (FP32 to mixed precision)
8 v1.1 8 : 16 85.00 / 90.94 85.04 / 90.96 5136 1378 3.73
8 v2.0 8 : 16 80.517 / 83.36 80.523 / 83.43 730 334 2.18
Fine-tuning accuracy: NVIDIA DGX-2 (16x V100 32GB)

Our results were obtained by running the scripts/run_squad.sh training script in the tensorflow:20.07-tf2-py3 NGC container on NVIDIA DGX-2 (16x V100 32G) GPUs.

ELECTRA BASE++

GPUs Batch size / GPU Accuracy / F1 - FP32 Accuracy / F1 - mixed precision Time to train - FP32 (sec) Time to train - mixed precision (sec) Time to train speedup (FP32 to mixed precision)
1 32 87.14 / 92.69 86.95 / 92.69 4478 1162 3.85
16 32 86.95 / 90.58 86.93 / 92.48 333 229 1.45
Training stability test
Pre-training stability test: NVIDIA DGX A100 (8x A100 40GB)

ELECTRA BASE Wikipedia and BookCorpus

Training stability with 48 x DGX A100, TF32 computations and loss reported after Phase 2:

Accuracy Metric Seed 1 Seed 2 Seed 3 Seed 4 Seed 5 Mean Standard Deviation
Final Loss 8.72 8.69 8.71 8.7 8.68 8.7 0.015
Fine-tuning stability test: NVIDIA DGX-1 (8x V100 16GB)

ELECTRA BASE++

Training stability with 8 GPUs, FP16 computations, batch size of 16 on SQuAD v1.1:

Accuracy Metric Seed 1 Seed 2 Seed 3 Seed 4 Seed 5 Mean Standard Deviation
Exact Match % 86.99 86.81 86.95 87.10 87.26 87.02 0.17
f1 % 92.7 92.66 92.65 92.61 92.97 92.72 0.14

Training stability with 8 GPUs, FP16 computations, batch size of 16 on SQuAD v2.0:

Accuracy Metric Seed 1 Seed 2 Seed 3 Seed 4 Seed 5 Mean Standard Deviation
Exact Match % 83.00 82.84 83.11 82.70 82.94 82.91 0.15
f1 % 85.63 85.48 85.69 85.31 85.57 85.54 0.15

Training performance results

Training performance: NVIDIA DGX A100 (8x A100 40GB)

Our results were obtained by running the scripts/benchmark_squad.sh training script in the tensorflow:20.07-tf2-py3 NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.

Pre-training NVIDIA DGX A100 (8x A100 40GB)
GPUs Batch size / GPU (TF32 and FP16) Accumulation steps (TF32 and FP16) Sequence length Throughput - TF32(sequences/sec) Throughput - mixed precision(sequences/sec) Throughput speedup (TF32 - mixed precision) Weak scaling - TF32 Weak scaling - mixed precision
1 88 and 176 768 and 384 128 533 955 1.79 1.00 1.00
8 88 and 176 96 and 48 128 4202 7512 1.79 7.88 7.87
1 12 and 24 2304 and 1152 512 90 171 1.90 1.00 1.00
8 12 and 24 288 and 144 512 716 1347 1.88 7.96 7.88
Fine-tuning NVIDIA DGX A100 (8x A100 40GB)
GPUs Batch size / GPU Sequence length Throughput - TF32 (sequences/sec) Throughput - mixed precision (sequences/sec) Throughput speedup (TF32 - mixed precision) Weak scaling - TF32 Weak scaling - mixed precision
1 32 384 107 317 2.96 1.00 1.00
8 32 384 828 2221 2.68 7.74 7.00
Training performance: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running the scripts/benchmark_squad.sh training scripts in the tensorflow:20.07-tf2-py3 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs. Performance numbers (in sequences per second) were averaged over an entire training epoch.

Pre-training NVIDIA DGX-1 (8x V100 16GB)
GPUs Batch size / GPU (FP32 and FP16) Accumulation steps (FP32 and FP16) Sequence length Throughput - FP32(sequences/sec) Throughput - mixed precision(sequences/sec) Throughput speedup (FP32 - mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
1 40 and 88 1689 and 768 128 116 444 3.83 1.00 1.00
8 40 and 88 211 and 96 128 920 3475 3.77 7.93 7.83
1 6 and 12 4608 and 2304 512 24 84 3.50 1.00 1.00
8 6 and 12 576 and 288 512 190 656 3.45 7.92 7.81
Fine-tuning NVIDIA DGX-1 (8x V100 16GB)
GPUs Batch size / GPU (FP32 : mixed precision) Sequence length Throughput - FP32 (sequences/sec) Throughput - mixed precision (sequences/sec) Throughput speedup (FP32 - mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
1 8 : 16 384 35 154 4.4 1.00 1.00
8 8 : 16 384 268 1051 3.92 7.66 6.82

To achieve these same results, follow the steps in the Quick Start Guide.

Training performance: NVIDIA DGX-2 (16x V100 32GB)

Our results were obtained by running the scripts/benchmark_squad.sh training scripts in the tensorflow:20.07-tf2-py3 NGC container on NVIDIA DGX-2 with (16x V100 32G) GPUs. Performance numbers (in sequences per second) were averaged over an entire training epoch.

Pre-training NVIDIA DGX-2 (16x V100 32GB)
GPUs Batch size / GPU (FP32 and FP16) Accumulation steps (FP32 and FP16) Sequence length Throughput - FP32(sequences/sec) Throughput - mixed precision(sequences/sec) Throughput speedup (FP32 - mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
1 88 and 176 768 and 384 128 128 500 3.91 1.00 1.00
8 88 and 176 96 and 48 128 1011 3916 3.87 7.90 7.83
16 88 and 176 48 and 24 128 2018 7773 3.85 15.77 15.55
1 12 and 24 2304 and 1152 512 27 96 3.55 1.00 1.00
8 12 and 24 288 and 144 512 213 754 3.54 7.89 7.85
16 12 and 24 144 and 72 512 426 1506 3.54 15.78 15.69
Fine-tuning NVIDIA DGX-2 (16x V100 32GB)
GPUs Batch size / GPU Sequence length Throughput - FP32 (sequences/sec) Throughput - mixed precision (sequences/sec) Throughput speedup (FP32 - mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
1 16 384 40 184 4.6 1.00 1.00
8 16 384 311 1289 4.14 7.77 7.00
16 16 384 626 2594 4.14 15.65 14.09

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance results

Inference performance: NVIDIA DGX A100 (1x A100 40GB)

Our results were obtained by running the scripts/benchmark_squad.sh inferencing benchmarking script in the tensorflow:20.07-tf2-py3 NGC container on NVIDIA DGX A100 (1x A100 40GB) GPU.

Fine-tuning inference on NVIDIA DGX A100 (1x A100 40GB)

FP16

Batch size Sequence length Throughput Avg (sequences/sec) Latency Avg (ms) Latency 90% (ms) Latency 95% (ms) Latency 99% (ms)
1 384 166 6.035 5.995 6.013 6.029
256 384 886 276.26 274.53 275.276 275.946
512 384 886 526.5 525.014 525.788 525.788

TF32

Batch size Sequence length Throughput Avg (sequences/sec) Latency Avg (ms) Latency 90% (ms) Latency 95% (ms) Latency 99% (ms)
1 384 122 8.228 8.171 8.198 8.221
256 384 342 729.293 727.990 728.505 729.027
512 384 350 1429.314 1427.719 1428.550 1428.550
Inference performance: NVIDIA T4

Our results were obtained by running the scripts/benchmark_squad.sh script in the tensorflow:20.07-tf2-py3 NGC container on NVIDIA Tesla T4 (1x T4 16GB) GPU.

Fine-tuning inference on NVIDIA T4

FP16

Batch size Sequence length Throughput Avg (sequences/sec) Latency Avg (ms) Latency 90% (ms) Latency 95% (ms) Latency 99% (ms)
1 384 58 17.413 17.295 17.349 17.395
128 384 185 677.298 675.211 675.674 676.269
256 384 169 1451.396 1445.070 1447.654 1450.141

To achieve these same results, follow the steps in the Quick Start Guide.