The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
Both of these benchmarking scripts enable you to run a number of epochs, extract performance numbers, and run the BERT model for fine tuning.
Training benchmarking can be performed by running the script:
biobert/scripts/biobert_finetune_training_benchmark.sh <task> <num_gpu> <bert_model> <cased>
This script runs 2 epochs by default on the NER BC5CDR dataset and extracts performance numbers for various batch sizes and sequence lengths in both FP16 and FP32. These numbers are saved at /results/tf_bert_biobert_<task>_training_benchmark__<bert_model>_<cased/uncased>_num_gpu_<num_gpu>_<DATESTAMP>
Training benchmarking can be performed by running the script:
biobert/scripts/biobert_finetune_inference_benchmark.sh <task> <bert_model> <cased>
This script runs inference on the test and dev sets and extracts performance and latency numbers for various batch sizes and sequence lengths in both FP16 with XLA and FP32 without XLA. These numbers are saved at /results/tf_bert_biobert_<task>_training_benchmark__<bert_model>_<cased/uncased>_num_gpu_<num_gpu>_<DATESTAMP>
The following sections provide detailed results of downstream fine-tuning task on NER and RE benchmark tasks.
Our results were obtained by running the scripts/run_pretraining_lamb.sh
training script in the TensorFlow 19.08-py3 NGC container.
DGX System | Nodes | Precision | Batch Size/GPU: Phase1, Phase2 | Accumulation Steps: Phase1, Phase2 | Time to Train (Hrs) | Final Loss |
---|---|---|---|---|---|---|
DGX2H | 4 | FP16 | 128, 16 | 8, 32 | 19.14 | 0.88 |
DGX2H | 16 | FP16 | 128, 16 | 2, 8 | 4.81 | 0.86 |
DGX2H | 32 | FP16 | 128, 16 | 1, 4 | 2.65 | 0.87 |
DGX1 | 1 | FP16 | 64, 8 | 128,512 | 174.58 | 0.87 |
DGX1 | 4 | FP16 | 64, 8 | 32, 128 | 57.71 | 0.85 |
DGX1 | 16 | FP16 | 64, 8 | 8, 32 | 12.62 | 0.87 |
DGX1 | 32 | FP16 | 64, 8 | 4, 16 | 6.97 | 0.87 |
Task | F1 | Precision | Recall |
---|---|---|---|
NER BC5CDR-chemical | 93.47 | 93.03 | 93.91 |
NER BC5CDR-disease | 86.22 | 85.05 | 87.43 |
RE Chemprot | 76.27 | 77.62 | 74.98 |
Our results were obtained by running the biobert/scripts/ner_bc5cdr-chem.sh
training script in the TensorFlow 19.08-py3 NGC container.
DGX System | Batch size / GPU | F1 - FP32 | F1- mixed precision | Time to Train - FP32 (Minutes) | Time to Train - mixed precision (Minutes) |
---|---|---|---|---|---|
DGX-1 16G | 64 | 93.33 | 93.40 | 23.95 | 14.13 |
DGX-1 32G | 64 | 93.31 | 93.36 | 24.35 | 12.63 |
DGX-2 32G | 64 | 93.66 | 93.47 | 12.26 | 8.16 |
The following tables compare F1 scores scores across 5 different training runs on the NER Chemical task with different seeds, for both FP16 and FP32. The runs showcase consistent convergence on all 5 seeds with very little deviation.
16 x V100 GPUs | seed 1 | seed 2 | seed 3 | seed 4 | seed 5 | mean | std |
---|---|---|---|---|---|---|---|
F1 Score (FP16) | 93.13 | 92.92 | 93.34 | 93.66 | 93.47 | 93.3 | 0.29 |
F1 Score (FP32) | 93.1 | 93.28 | 93.33 | 93.45 | 93.17 | 93.27 | 0.14 |
Our results were obtained by running the biobert/scripts/run_biobert.sub
training script in the TensorFlow 19.08-py3 NGC container using multiple NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the steady state throughput.
Nodes | Sequence Length | Batch size / GPU: mixed precision, FP32 | Throughput - mixed precision | Throughput - FP32 | Throughput speedup (FP32 to mixed precision) | Weak scaling - mixed precision | Weak scaling - FP32 |
---|---|---|---|---|---|---|---|
1 | 128 | 64,32 | 2762.06 | 744.48 | 3.71 | 1.00 | 1.00 |
4 | 128 | 64,32 | 10283.08 | 2762.88 | 3.72 | 3.72 | 3.71 |
16 | 128 | 64,32 | 39051.69 | 10715.14 | 3.64 | 14.14 | 14.39 |
32 | 128 | 64,32 | 76077.39 | 21104.87 | 3.60 | 27.54 | 28.35 |
1 | 512 | 8,8 | 432.33 | 160.38 | 2.70 | 1.00 | 1.00 |
4 | 512 | 8,8 | 1593.00 | 604.36 | 2.64 | 3.68 | 3.77 |
16 | 512 | 8,8 | 5941.82 | 2356.44 | 2.52 | 13.74 | 14.69 |
32 | 512 | 8,8 | 11483.73 | 4631.29 | 2.48 | 26.56 | 28.88 |
Note: The respective values for FP32 runs that use a batch size of 16, 2 in sequence lengths 128 and 512 respectively are not available due to out of memory errors that arise.
Our results were obtained by running the biobert/scripts/ner_bc5cdr-chem.sh
training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.
GPUs | Batch size / GPU | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|
1 | 64 | 147.71 | 348.84 | 2.36 | 1.00 | 1.00 |
4 | 64 | 583.78 | 1145.46 | 1.96 | 3.95 | 3.28 |
8 | 64 | 981.22 | 1964.85 | 2.00 | 6.64 | 5.63 |
To achieve these same results, follow the Quick Start Guide outlined above.
Our results were obtained by running the biobert/scripts/ner_bc5cdr-chem.sh
training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.
GPUs | Batch size / GPU | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|
1 | 64 | 144.1 | 417.39 | 2.89 | 1.00 | 1.00 |
4 | 64 | 525.15 | 1354.14 | 2.57 | 3.64 | 3.24 |
8 | 64 | 969.4 | 2341.39 | 2.41 | 6.73 | 5.61 |
To achieve these same results, follow the Quick Start Guide outlined above.
Our results were obtained by running the biobert/scripts/run_biobert.sub
training script in the TensorFlow 19.08-py3 NGC container using multiple NVIDIA DGX-2H with 16x V100 32G GPUs. Performance (in sentences per second) is the steady state throughput.
Nodes | Sequence Length | Batch size / GPU: mixed precision, FP32 | Throughput - mixed precision | Throughput - FP32 | Throughput speedup (FP32 to mixed precision) | Weak scaling - mixed precision | Weak scaling - FP32 |
---|---|---|---|---|---|---|---|
1 | 128 | 128,128 | 7772.18 | 2165.04 | 3.59 | 1.00 | 1.00 |
4 | 128 | 128,128 | 29785.31 | 8516.90 | 3.50 | 3.83 | 3.93 |
16 | 128 | 128,128 | 115581.29 | 33699.15 | 3.43 | 14.87 | 15.57 |
32 | 128 | 128,128 | 226156.53 | 66996.73 | 3.38 | 29.10 | 30.94 |
64 | 128 | 128,128 | 444955.74 | 133424.95 | 3.33 | 57.25 | 61.63 |
1 | 512 | 16,16 | 1260.06 | 416.92 | 3.02 | 1.00 | 1.00 |
4 | 512 | 16,16 | 4781.19 | 1626.76 | 2.94 | 3.79 | 3.90 |
16 | 512 | 16,16 | 18405.65 | 6418.09 | 2.87 | 14.61 | 15.39 |
32 | 512 | 16,16 | 36071.06 | 12713.67 | 2.84 | 28.63 | 30.49 |
64 | 512 | 16,16 | 69950.86 | 25245.96 | 2.77 | 55.51 | 60.55 |
Our results were obtained by running the biobert/scripts/ner_bc5cdr-chem.sh
training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.
GPUs | Batch size / GPU | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|
1 | 64 | 139.59 | 475.54 | 3.4 | 1.00 | 1.00 |
4 | 64 | 517.08 | 1544.01 | 2.98 | 3.70 | 3.25 |
8 | 64 | 1009.84 | 2695.34 | 2.66 | 7.23 | 5.67 |
16 | 64 | 1997.73 | 4268.81 | 2.13 | 14.31 | 8.98 |
To achieve these same results, follow the Quick Start Guide outlined above.