BERT for biomedical text-mining.
In the original BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper, pre-training is done on Wikipedia and Books Corpus, with state-of-the-art results demonstrated on SQuAD (Stanford Question Answering Dataset) benchmark.
Meanwhile, many works, including BioBERT, SciBERT, NCBI-BERT, ClinicalBERT (MIT), ClinicalBERT (NYU, Princeton), and others at BioNLP'19 workshop, show that additional pre-training of BERT on large biomedical text corpus such as PubMed results in better performance in biomedical text-mining tasks.
This repository provides scripts and recipe to adopt the NVIDIA BERT code-base to achieve state-of-the-art results in the following biomedical text-mining benchmark tasks:
BC5CDR-disease A Named-Entity-Recognition task to recognize diseases mentioned in a collection of 1500 PubMed titles and abstracts (Li et al., 2016)
BC5CDR-chemical A Named-Entity-Recognition task to recognize chemicals mentioned in a collection of 1500 PubMed titles and abstracts (Li et al., 2016)
ChemProt A Relation-Extraction task to determine chemical-protein interactions in a collection of 1820 PubMed abstracts (Krallinger et al., 2017)
This model was trained using script available on NGC and in GitHub repo.
The following datasets were used to train this model:
Performance numbers for this model are available in NGC.