NGC | Catalog
CatalogResourcesBioBERT for TensorFlow1

BioBERT for TensorFlow1

For downloads and more information, please view on a desktop device.
Logo for BioBERT for TensorFlow1

Description

BERT for biomedical text-mining.

Publisher

NVIDIA Deep Learning Examples

Use Case

Language Modeling

Framework

Other

Latest Version

20.06.0

Modified

November 4, 2022

Compressed Size

33.71 KB

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Both of these benchmarking scripts enable you to run a number of epochs, extract performance numbers, and run the BERT model for fine tuning.

Training performance benchmark

Training benchmarking can be performed by running the script:

biobert/scripts/biobert_finetune_training_benchmark.sh <task> <num_gpu> <bert_model> <cased>

This script runs 2 epochs by default on the NER BC5CDR dataset and extracts performance numbers for various batch sizes and sequence lengths in both FP16 and FP32. These numbers are saved at /results/tf_bert_biobert_<task>_training_benchmark__<bert_model>_<cased/uncased>_num_gpu_<num_gpu>_<DATESTAMP>

Inference performance benchmark

Training benchmarking can be performed by running the script:

biobert/scripts/biobert_finetune_inference_benchmark.sh <task> <bert_model> <cased>

This script runs inference on the test and dev sets and extracts performance and latency numbers for various batch sizes and sequence lengths in both FP16 with XLA and FP32 without XLA. These numbers are saved at /results/tf_bert_biobert_<task>_training_benchmark__<bert_model>_<cased/uncased>_num_gpu_<num_gpu>_<DATESTAMP>

Results

The following sections provide detailed results of downstream fine-tuning task on NER and RE benchmark tasks.

Training accuracy results

Pre-training accuracy

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 19.08-py3 NGC container.

DGX System Nodes Precision Batch Size/GPU: Phase1, Phase2 Accumulation Steps: Phase1, Phase2 Time to Train (Hrs) Final Loss
DGX2H 4 FP16 128, 16 8, 32 19.14 0.88
DGX2H 16 FP16 128, 16 2, 8 4.81 0.86
DGX2H 32 FP16 128, 16 1, 4 2.65 0.87
DGX1 1 FP16 64, 8 128,512 174.58 0.87
DGX1 4 FP16 64, 8 32, 128 57.71 0.85
DGX1 16 FP16 64, 8 8, 32 12.62 0.87
DGX1 32 FP16 64, 8 4, 16 6.97 0.87
Fine-tuning accuracy
Task F1 Precision Recall
NER BC5CDR-chemical 93.47 93.03 93.91
NER BC5CDR-disease 86.22 85.05 87.43
RE Chemprot 76.27 77.62 74.98
Fine-tuning accuracy for NER Chem

Our results were obtained by running the biobert/scripts/ner_bc5cdr-chem.sh training script in the TensorFlow 19.08-py3 NGC container.

DGX System Batch size / GPU F1 - FP32 F1- mixed precision Time to Train - FP32 (Minutes) Time to Train - mixed precision (Minutes)
DGX-1 16G 64 93.33 93.40 23.95 14.13
DGX-1 32G 64 93.31 93.36 24.35 12.63
DGX-2 32G 64 93.66 93.47 12.26 8.16

Training stability test

Fine-tuning stability test:

The following tables compare F1 scores scores across 5 different training runs on the NER Chemical task with different seeds, for both FP16 and FP32. The runs showcase consistent convergence on all 5 seeds with very little deviation.

16 x V100 GPUs seed 1 seed 2 seed 3 seed 4 seed 5 mean std
F1 Score (FP16) 93.13 92.92 93.34 93.66 93.47 93.3 0.29
F1 Score (FP32) 93.1 93.28 93.33 93.45 93.17 93.27 0.14

Training performance results

Training performance: NVIDIA DGX-1 (8x V100 16G)
Pre-training training performance: multi-node on DGX-1 16G

Our results were obtained by running the biobert/scripts/run_biobert.sub training script in the TensorFlow 19.08-py3 NGC container using multiple NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the steady state throughput.

Nodes Sequence Length Batch size / GPU: mixed precision, FP32 Throughput - mixed precision Throughput - FP32 Throughput speedup (FP32 to mixed precision) Weak scaling - mixed precision Weak scaling - FP32
1 128 64,32 2762.06 744.48 3.71 1.00 1.00
4 128 64,32 10283.08 2762.88 3.72 3.72 3.71
16 128 64,32 39051.69 10715.14 3.64 14.14 14.39
32 128 64,32 76077.39 21104.87 3.60 27.54 28.35
1 512 8,8 432.33 160.38 2.70 1.00 1.00
4 512 8,8 1593.00 604.36 2.64 3.68 3.77
16 512 8,8 5941.82 2356.44 2.52 13.74 14.69
32 512 8,8 11483.73 4631.29 2.48 26.56 28.88

Note: The respective values for FP32 runs that use a batch size of 16, 2 in sequence lengths 128 and 512 respectively are not available due to out of memory errors that arise.

Fine-tuning training performance for NER on DGX-1 16G

Our results were obtained by running the biobert/scripts/ner_bc5cdr-chem.sh training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.

GPUs Batch size / GPU Throughput - FP32 Throughput - mixed precision Throughput speedup (FP32 to mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
1 64 147.71 348.84 2.36 1.00 1.00
4 64 583.78 1145.46 1.96 3.95 3.28
8 64 981.22 1964.85 2.00 6.64 5.63

To achieve these same results, follow the Quick Start Guide outlined above.

Training performance: NVIDIA DGX-1 (8x V100 32G)
Fine-tuning training performance for NER on DGX-1 32G

Our results were obtained by running the biobert/scripts/ner_bc5cdr-chem.sh training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.

GPUs Batch size / GPU Throughput - FP32 Throughput - mixed precision Throughput speedup (FP32 to mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
1 64 144.1 417.39 2.89 1.00 1.00
4 64 525.15 1354.14 2.57 3.64 3.24
8 64 969.4 2341.39 2.41 6.73 5.61

To achieve these same results, follow the Quick Start Guide outlined above.

Training performance: NVIDIA DGX-2 (16x V100 32G)
Pre-training training performance: multi-node on DGX-2H 32G

Our results were obtained by running the biobert/scripts/run_biobert.sub training script in the TensorFlow 19.08-py3 NGC container using multiple NVIDIA DGX-2H with 16x V100 32G GPUs. Performance (in sentences per second) is the steady state throughput.

Nodes Sequence Length Batch size / GPU: mixed precision, FP32 Throughput - mixed precision Throughput - FP32 Throughput speedup (FP32 to mixed precision) Weak scaling - mixed precision Weak scaling - FP32
1 128 128,128 7772.18 2165.04 3.59 1.00 1.00
4 128 128,128 29785.31 8516.90 3.50 3.83 3.93
16 128 128,128 115581.29 33699.15 3.43 14.87 15.57
32 128 128,128 226156.53 66996.73 3.38 29.10 30.94
64 128 128,128 444955.74 133424.95 3.33 57.25 61.63
1 512 16,16 1260.06 416.92 3.02 1.00 1.00
4 512 16,16 4781.19 1626.76 2.94 3.79 3.90
16 512 16,16 18405.65 6418.09 2.87 14.61 15.39
32 512 16,16 36071.06 12713.67 2.84 28.63 30.49
64 512 16,16 69950.86 25245.96 2.77 55.51 60.55
Fine-tuning training performance for NER on DGX-2 32G

Our results were obtained by running the biobert/scripts/ner_bc5cdr-chem.sh training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.

GPUs Batch size / GPU Throughput - FP32 Throughput - mixed precision Throughput speedup (FP32 to mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
1 64 139.59 475.54 3.4 1.00 1.00
4 64 517.08 1544.01 2.98 3.70 3.25
8 64 1009.84 2695.34 2.66 7.23 5.67
16 64 1997.73 4268.81 2.13 14.31 8.98

To achieve these same results, follow the Quick Start Guide outlined above.