This is a checkpoint for the BioBERT v1.1 Base cased https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 for Question Answering finetuned in NeMo https://github.com/NVIDIA/NeMo on the question answering dataset SQuADv1.1 https://rajpurkar.github.io/SQuAD-explorer/. The model is trained for 2 epochs on a DGX1 with 8 V100 GPUs using Apex/Amp optimization level O2. On the development data this model achieves an exact match (EM) score of 80.52 and F1 score of 88.02. The BioBERT model was pretrained on PubMed https://catalog.data.gov/dataset/pubmed, a biomedical domain dataset, for 1 million steps and downloaded from https://github.com/dmis-lab/biobert#download and converted into NeMo compatible format.
The model achieves weighted SAcc/MRR/LAcc of 39/59.86/47.03 on BioASQ-7b-factoid test set.
Please be sure to download the latest version in order to ensure compatibility with the latest NeMo release.
BERT, or Bidirectional Encoder Representations from Transformers, is a neural approach to pre-train language representations which obtains near state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks, including SQuAD Question Answering dataset. This model has the same network architecture as the original BERT, but is pretrained on a different dataset -PubMed, a large biomedical text corpus, which achieves better performance in biomedical downstream tasks, such as question answering(QA). Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage. Apart from the BERT encoder architecture this model also includes a question answering model head, which is stacked on top of BERT. This question answering head is a token classifier and is, more specifically, a single fully connected layer.
Training works as follows: The user provides training and evaluation data in text form in JSON format. This data is parsed by scripts in NeMo and converted into model input. The input sequence is a concatenatenation of a tokenized query and its according reading passage. The question answering head predicts for each token in the reading passage or context if it is the start or end of the answer span. The model is trained using cross entropy loss.
For more information about BERT or BioBERT please visit https://ngc.nvidia.com/catalog/models/nvidia:bertbaseuncasedfornemo or https://ngc.nvidia.com/catalog/models/nvidia:biobertbasecasedfornemo.
Source code and developer guide is available at https://github.com/NVIDIA/NeMo Refer to documentation at https://docs.nvidia.com/deeplearning/nemo/neural-modules-release-notes/index.html Code to pretrain and reproduce this model checkpoint are available at https://github.com/NVIDIA/NeMo.
This model checkpoint can be used for either inference or finetuning on biomedical question answering datasets, as long as they are in the required format. More details at https://github.com/NVIDIA/NeMo.
In the following we show examples for how to finetune and evaluate on BioASQ 7B.
Visit https://github.com/NVIDIA/NeMo/blob/master/examples/nlp/biobert_notebooks/biobert_qa.ipynb