April 2022: We have released our first clinical-domain pre-trained model checkpoints in Clara NLP, GatorTron-OG and GatorTron-S, in collaboration with the University of Florida Health System!
NVIDIA Clara NLP is a collection of models and resources that can support natural language processing and understanding workflows in healthcare and life sciences. They enable developers to build services and data processing pipelines that can extract knowledge from clinical and biomedical text. It includes state of the art biomedical and clinical NLP models as well as highly optimized pipelines for training NLP models.
Clara NLP includes pre-trained Megatron [1-2] checkpoints for both biomedical and clinical domain tasks. These include BioMegatron [3], a state-of-the-art biomedical language model pre-trained on billions of words of PubMed abstracts and full text documents. We also provide NeMo checkpoints for other models targeting biomedical language understanding, including BioBERT [4].
New in 2022 is the release of our first clinical-domain pre-trained model checkpoints in Clara NLP, GatorTron-OG and GatorTron-S. These models are released on NGC by the University of Florida Health System, though a collaboration with NVIDIA to make state-of-the-art clinical NLP accessible to the community. These models provide state of the art pre-trained checkpoints trained on diverse, de-identified clinical free text throughout the UF Health System, in addition to innovative new work pre-training models on data produced by synthetic, generative transformer models.
Clara NLP also makes makes use of optimized Docker images for NVIDIA NeMo, our open-source toolkit for developing state-of-the-art conversational AI and NLP models.
To pre-train your own customized language models, check out our Megatron training recipes on github, and new pre-training support for many model architectures in NeMo.
To fine-tune BioMegatron, GatorTron, or your own Megatron pre-trained model, check out our tutorial section in NeMo. From here, you can start with an out of the box BioMegatron model, or download any of the collection's models and vocabularies for use with NeMo.
GatorTron-OG is a 345 million parameter cased Megatron model that was pre-trained on a de-identified set of clinical notes at the University of Florida Health System and includes a 50K clinical domain vocabulary.
GatorTron-S is a 345 million parameter cased Megatron model that was pre-trained on a collection of synthetic discharge summaries generated by the University of Florida's SynGatorTron 5B NLG and includes a 50K clinical domain vocabulary.
BioMegatron345mCased is a 345 million parameter Megatron model that was pretrained on a cased biomedical PubMed dataset. Additional pre-trained checkpoints using a biomedical vocabulary are available:
BioMegatron345m-biovocab-30k-cased is a 345 million parameter Megatron model that was pretrained on a cased biomedical PubMed dataset, using a 30K cased biomedical domain vocabulary.
BioMegatron345m-biovocab-50k-cased is a 345 million parameter Megatron model that was pretrained on a cased biomedical PubMed dataset, using a 50K cased biomedical domain vocabulary.
BioMegatron345mUncased is a 345 million parameter Megatron model that was pretrained on an uncased biomedical PubMed dataset. Additional pre-trained checkpoints using a biomedical vocabulary are available:
BioMegatron345m-biovocab-30k-uncased is a 345 million parameter Megatron model that was pretrained on an uncased biomedical PubMed dataset, using a 30K uncased biomedical domain vocabulary.
BioMegatron345m-biovocab-50k-uncased is a 345 million parameter Megatron model that was pretrained on an uncased biomedical PubMed dataset, using a 50K uncased biomedical domain vocabulary.
BioBERTBaseCasedForNeMo is a NeMo compatible checkpoint for the BioBERT Base Cased model that was converted from the checkpoint found here.
BioBERTLargeCasedForNeMo is a NeMo compatible checkpoint for the BioBERT Large Cased model that was converted from the checkpoint found here.
Use the NVIDIA DevTalk Forum for questions regarding this software.
[1] Shoeybi, M. et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv [cs.CL] (2019)
[2] Narayanan, D. et al. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. arXiv [cs.CL] (2021) doi:10.1145/3458817.3476209
[3] Shin, H.-C. et al. BioMegatron: Larger Biomedical Domain Language Model. arXiv [cs.CL] (2020)
[4] Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. arXiv [cs.CL] (2019) doi:10.1093/database/bay073/5055578
[5] Yang, X. et al. GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records. arXiv [cs.CL] (2022)