The following sections provide greater details of the dataset, running training and inference, and the training results. ### Scripts and sample code In addition to BERT TensorFlow files, the most important files added for NER and RE fine tuning tasks are: * `run_ner.py` - Serves as an entry point for NER training. * `run_re.py` - Serves as an entry point for RE training. The `biobert/scripts/` folder encapsulates all the one-click scripts required for running various functionalities supported such as: * `ner_bc5cdr-chem.sh` - Runs NER training and inference on the BC5CDR Chemical dataset using the `run_ner.py` file. * `ner_bc5cdr-disease.sh` - Runs NER training and inference on the BC5CDR Disease dataset using the `run_ner.py` file. * `rel_chemprot.sh` - Runs RE training and inference on the ChemProt dataset using the `run_re.py` file. * `run_pretraining_pubmed_base_phase_*.sh` - Runs pre-training with LAMB optimizer using the `run_pretraining.py` file in two phases. Phase 1 does training with sequence length = 128. In phase 2, the remaining 10% of the training is done with sequence length = 512. * `biobert_data_download.sh` - Downloads the PubMed dataset and Vocab files using files in the `data/` folder. * `run_biobert_finetuning_inference.sh` - Runs task specific inference using a fine tuned checkpoint. ### Parameters Aside from the options to set hyperparameters, some relevant options to control the behaviour of the `run_ner.py` and `run_re.py` scripts are: ``` --bert_config_file: The config json file corresponding to the pre-trained BERT model. This specifies the model architecture. --vocab_file: The vocabulary file that the BERT model was trained on. --output_dir: The output directory where the model checkpoints will be written. --[no]do_eval: Whether to run evaluation on the dev set. (default: 'false') --[no]do_predict: Whether to run evaluation on the test set. (default: 'false') --[no]do_train: Whether to run training. (default: 'false') --learning_rate: The initial learning rate for Adam.(default: '5e-06')(a number) --max_seq_length: The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded.(default: '384')(an integer) --predict_batch_size: Total batch size for predictions.(default: '8')(an integer) --train_batch_size: Total batch size for training (default: '8')(an integer) --[no]use_fp16: Whether to enable AMP ops.(default: 'false') --[no]use_xla: Whether to enable XLA JIT compilation.(default: 'false') --init_checkpoint: Initial checkpoint (usually from a pre-trained BERT model). --num_train_epochs: Total number of training epochs to perform.(default: '3.0')(a number) ``` Note: When initializing from a checkpoint using `--init_checkpoint` and a corpus of your choice, keep in mind that `bert_config_file` and `vocab_file` should remain unchanged. ### Command-line options To see the full list of available options and their descriptions, use the `-h` or `--help` command-line option with the Python file, for example: ```bash python run_ner.py --help python run_re.py --help ``` ### Getting the data For pre-training BERT, we use the PubMed Dataset. For PubMed, we extract the xml files which are structured as a document level corpus rather than a shuffled sentence level corpus because it is critical to extract long contiguous sentences. The next step is to run `create_pretraining_data.py` with the document level corpus as input, which generates input data and labels for the masked language modeling and next sentence prediction tasks. Pre-training can also be performed on any corpus of your choice. The collection of data generation scripts are intended to be modular to allow modifications for additional preprocessing steps or to use additional data. They can hence easily be modified for an arbitrary corpus. The preparation of an individual pre-training dataset is described in the `create_biobert_datasets_from_start.sh ` script found in the `data/` folder. The component steps to prepare the datasets are as follows: 1. Data download and extract - the dataset is downloaded and extracted. 2. Clean and format - document tags, etc. are removed from the dataset. The end result of this step is a `{dataset_name_one_article_per_line}.txt` file that contains the entire corpus. Each line in the text file contains an entire document from the corpus. One file per dataset is created in the `formatted_one_article_per_line` folder. 3. Sharding - the sentence segmented corpus file is split into a number of smaller text documents. The sharding is configured so that a document will not be split between two shards. Sentence segmentation is performed at this time using NLTK. 4. TFRecord file creation - each text file shard is processed by the `create_pretraining_data.py` script to produce a corresponding TFRecord file. The script generates input data and labels for masked language modeling and sentence prediction tasks for the input text shard. For fine tuning BioBERT for the task of Named Entity Recognition and Relation Extraction Tasks, we use BC5CDR and Chemprot Datasets. BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. ChemProt corpus consists of text exhaustively annotated by hand with mentions of chemical compounds/drugs and genes/proteins, as well as 22 different types of compound-protein relations focussing on 5 important relation classes. It was preprocessed following [Lim and Kang](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6014134/) guidelines. #### Dataset guidelines The procedure to prepare a text corpus for pre-training is described in the previous section. This section provides additional insight into how exactly raw text is processed so that it is ready for pre-training. First, raw text is tokenized using [WordPiece tokenization](https://arxiv.org/pdf/1609.08144.pdf). A [CLS] token is inserted at the start of every sequence, and the two sentences in the sequence are separated by a [SEP] token. Note: BERT pre-training looks at pairs of sentences at a time. A sentence embedding token [A] is added to the first sentence and token [B] to the next. BERT pre-training optimizes for two unsupervised classification tasks. The first is Masked Language Modelling (Masked LM). One training instance of Masked LM is a single modified sentence. Each token in the sentence has a 15% chance of being replaced by a [MASK] token. The chosen token is replaced with [MASK] 80% of the time, 10% with another random token and the remaining 10% with the same token. The task is then to predict the original token. The second task is next sentence prediction. One training instance of BERT pre-training is two sentences (a sentence pair). A sentence pair may be constructed by simply taking two adjacent sentences from a single document, or by pairing up two random sentences with equal probability. The goal of this task is to predict whether or not the second sentence followed the first in the original document. The `create_pretraining_data.py` script takes in raw text and creates training instances for both pre-training tasks. #### Multi-dataset We are able to combine multiple datasets into a single dataset for pre-training on a diverse text corpus. Once TFRecords have been created for each component dataset, you can create a combined dataset by adding the directory to `*FILES_DIR` in `run_pretraining_*.sh`. This will feed all matching files to the input pipeline in `run_pretraining.py`. However, in the training process, only one TFRecord file is consumed at a time, therefore, the training instances of any given training batch will all belong to the same source dataset. ### Training process The training process consists of two steps: pre-training and fine tuning. #### Pre-training BERT is designed to pre-train deep bidirectional representations for language representations. The following scripts are to pre-train BERT on PubMed dataset. These scripts are general and can be used for pre-training language representations on additional corpus of biomedical text. Pre-training is performed using the `run_pretraining.py` script along with parameters defined in the `biobert/scripts/run_pretraining_pubmed_base_phase_1.sh` and `biobert/scripts/run_pretraining_pubmed_base_phase_2.sh` scripts. The `biobert/scripts/run_pretraining_pubmed_base_phase*.sh` scripts run a job on a single node that trains the BERT-base model from scratch using the PubMed Corpus dataset as training data. By default, the training script: - Runs on 16 GPUs - Has FP16 precision enabled - Is XLA enabled - Creates a log file containing all the output - Saves a checkpoint every 5000 iterations (keeps only the latest checkpoint) and at the end of training. All checkpoints, evaluation results, and training logs are saved to the `/results` directory (in the container which can be mounted to a local directory). - Evaluates the model at the end of each phase - Phase 1 - Runs 19531 steps with 1953 warmup steps - Sets Maximum sequence length as 128 - Sets Global Batch size as 64K - Phase 2 - Runs 4340 steps with 434 warm-up steps - Sets Maximum sequence length as 512 - Sets Global Batch size as 32K - Should start from Phase1's final checkpoint These parameters train PubMed with reasonable accuracy on a DGX-2 with 32GB V100 cards. For example: ```bash biobert/scripts/run_pretraining-pubmed_base_phase_1.sh ``` Where: - `` is per-GPU batch size used for training. Batch size varies with precision, larger batch sizes run more efficiently, but require more memory. - `` is the default rate of 3.2e-5 is good for global batch size 64k. - `` is set to `true` or `false` depending on whether the model should be trained on cased or uncased data. - `` is the type of math in your model, can be either `fp32` or `fp16`. Specifically: - `fp32` is 32-bit IEEE single precision floats. - `fp16` is Automatic rewrite of TensorFlow compute graph to take advantage of 16-bit arithmetic whenever it is safe. - `` is the number of GPUs to use for training. Must be equal to or smaller than the number of GPUs attached to your node. - `` is the number of warm-up steps at the start of training. - `` is the total number of training steps. - `` controls how often checkpoints are saved. Default is 5000 steps. - `` is used to mimic higher batch sizes in the respective phase by accumulating gradients N times before weight update. - `` is used to indicate whether to pretrain BERT Large or BERT Base model. - `` is per-GPU batch size used for evaluation after training. The following sample code trains phase 1 of BERT-base from scratch on a single DGX-2 using FP16 arithmetic and uncased data. ```bash biobert/scripts/run_pretraining-pubmed_base_phase_1.sh 128 3.2e-5 false fp16 true 16 1953 19531 32 5000 80 ``` #### Fine tuning Fine tuning is performed using the `run_ner.py` script along with parameters defined in `biobert/scripts/ner_bc5cdr*.sh`. For example, `biobert/scripts/ner_bc5cdr-chem.sh` script trains a model and performs evaluation on the BC5CDR Chemical dataset. By default, the training script: - Trains on BERT Base Uncased Model - Uses 16 GPUs and batch size of 8 on each GPU - Has FP16 precision enabled - Is XLA enabled - Runs for 10 epochs - Evaluation is done at the end of training. To skip evaluation, modify `--do_eval` and `--do_predict` to `False`. This script outputs checkpoints to the `/results` directory, by default, inside the container. Mount point of `/results` can be changed in the `scripts/docker/launch.sh` file. The training log contains information about: - Loss for the final step - Training and evaluation performance - F1, Precision and Recall on the Test Set of BC5CDR Chemical after evaluation. The summary after training is printed in the following format: ```bash 0: /results/biobert_finetune_ner_chem_191028154209/test_labels.txt 0: /results/biobert_finetune_ner_chem_191028154209/test_labels_errs.txt 0: processed 124669 tokens with 5433 phrases; found: 5484 phrases; correct: 5102. 0: accuracy: 99.26%; precision: 93.03%; recall: 93.91%; FB1: 93.47 0: : precision: 93.03%; recall: 93.91%; FB1: 93.47 5484 ``` Multi-GPU training is enabled with the Horovod TensorFlow module. The following example runs training on 16 GPUs: ```bash BERT_DIR=data/download/google_pretrained_weights/uncased_L-12_H-768_A-12 DATA_DIR=data/biobert/BC5CDR/chem mpi_command="mpirun -np 16 -H localhost:16 \ --allow-run-as-root -bind-to none -map-by slot \ -x NCCL_DEBUG=INFO \ -x LD_LIBRARY_PATH \ -x PATH -mca pml ob1 -mca btl ^openib" \ python run_ner.py --horovod --amp --use_xla \ --vocab_file=$BERT_DIR/vocab.txt \ --bert_config_file=$BERT_DIR/bert_config.json \ --output_dir=/results --data_dir=$DATA_DIR" ``` #### Multi-node Multi-node runs can be launched on a pyxis/enroot Slurm cluster (see [Requirements](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT#requirements)) with the `biobert/scripts/run_biobert.sub` script with the following command for a 4-node DGX2 example for both phase 1 and phase 2: ```bash BATCHSIZE=128 LEARNING_RATE='8e-6' NUM_ACCUMULATION_STEPS=8 PHASE=1 sbatch -N4 --ntasks-per-node=16 biobert/scripts/run_biobert.sub BATCHSIZE=16 LEARNING_RATE='3.2e-5' NUM_ACCUMULATION_STEPS=32 PHASE=1 sbatch -N4 --ntasks-per-node=16 biobert/scripts/run_biobert.sub ``` Checkpoint after phase 1 will be saved in `checkpointdir` specified in `biobert/scripts/run_biobert.sub`. The checkpoint will be automatically picked up to resume training on phase 2. Note that phase 2 should be run after phase 1. Variables to re-run the [Training performance results](#training-performance-results) are available in the `configurations.yml` file. The batch variables `BATCHSIZE`, `LEARNING_RATE`, `NUM_ACCUMULATION_STEPS` refer to the Python arguments `train_batch_size`, `learning_rate`, `num_accumulation_steps` respectively. The variable `PHASE` refers to phase specific arguments available in `biobert/scripts/run_biobert.sub`. Note that the `biobert/scripts/run_biobert.sub` script is a starting point that has to be adapted depending on the environment. In particular, variables such as `datadir` handle the location of the files for each phase. Refer to the file contents to see the full list of variables to adjust for your system. ### Inference process Inference on a fine tuned model for Bio Medical tasks is performed using the `run_ner.py` or `run_re.py` script along with parameters defined in `biobert/scripts/run_biobert_finetuning_inference.sh`. Inference is supported on a single GPU. The `biobert/scripts/run_biobert_finetuning_inference.sh` script performs evaluation on ChemProt or BC5CDR datasets depending on the task specified. By default, the inferencing script: - Uses BC5CDR Chemical dataset - Has FP16 precision enabled - Is XLA enabled - Evaluates the latest checkpoint present in `/results` with a batch size of 16. This script computes F1, Precision and Recall scores. Mount point of `/results` can be changed in the `scripts/docker/launch.sh` file.