NGC | Catalog

Clara NLP

Logo for Clara NLP
Description
Clara NLP is a collection of SOTA biomedical pre-trained language models as well as highly optimized pipelines for training NLP models on biomedical and clinical text
Curator
NVIDIA
Modified
April 4, 2023
Containers
Sorry, your browser does not support inline SVG.
Helm Charts
Sorry, your browser does not support inline SVG.
Models
Sorry, your browser does not support inline SVG.
Resources
Sorry, your browser does not support inline SVG.

What's New?

April 2022: We have released our first clinical-domain pre-trained model checkpoints in Clara NLP, GatorTron-OG and GatorTron-S, in collaboration with the University of Florida Health System!

What is Clara NLP?

NVIDIA Clara NLP is a collection of models and resources that can support natural language processing and understanding workflows in healthcare and life sciences. They enable developers to build services and data processing pipelines that can extract knowledge from clinical and biomedical text. It includes state of the art biomedical and clinical NLP models as well as highly optimized pipelines for training NLP models.

BioMegatron

Clara NLP includes pre-trained Megatron [1-2] checkpoints for both biomedical and clinical domain tasks. These include BioMegatron [3], a state-of-the-art biomedical language model pre-trained on billions of words of PubMed abstracts and full text documents. We also provide NeMo checkpoints for other models targeting biomedical language understanding, including BioBERT [4].

GatorTron

New in 2022 is the release of our first clinical-domain pre-trained model checkpoints in Clara NLP, GatorTron-OG and GatorTron-S. These models are released on NGC by the University of Florida Health System, though a collaboration with NVIDIA to make state-of-the-art clinical NLP accessible to the community. These models provide state of the art pre-trained checkpoints trained on diverse, de-identified clinical free text throughout the UF Health System, in addition to innovative new work pre-training models on data produced by synthetic, generative transformer models.

NVIDIA NeMo

Clara NLP also makes makes use of optimized Docker images for NVIDIA NeMo, our open-source toolkit for developing state-of-the-art conversational AI and NLP models.

Getting Started

To pre-train your own customized language models, check out our Megatron training recipes on github, and new pre-training support for many model architectures in NeMo.

To fine-tune BioMegatron, GatorTron, or your own Megatron pre-trained model, check out our tutorial section in NeMo. From here, you can start with an out of the box BioMegatron model, or download any of the collection's models and vocabularies for use with NeMo.

What's In This Collection?

GatorTron

  • GatorTron-OG is a 345 million parameter cased Megatron model that was pre-trained on a de-identified set of clinical notes at the University of Florida Health System and includes a 50K clinical domain vocabulary.

  • GatorTron-S is a 345 million parameter cased Megatron model that was pre-trained on a collection of synthetic discharge summaries generated by the University of Florida's SynGatorTron 5B NLG and includes a 50K clinical domain vocabulary.

BioMegatron

  • BioMegatron345mCased is a 345 million parameter Megatron model that was pretrained on a cased biomedical PubMed dataset. Additional pre-trained checkpoints using a biomedical vocabulary are available:

    • BioMegatron345m-biovocab-30k-cased is a 345 million parameter Megatron model that was pretrained on a cased biomedical PubMed dataset, using a 30K cased biomedical domain vocabulary.

    • BioMegatron345m-biovocab-50k-cased is a 345 million parameter Megatron model that was pretrained on a cased biomedical PubMed dataset, using a 50K cased biomedical domain vocabulary.

  • BioMegatron345mUncased is a 345 million parameter Megatron model that was pretrained on an uncased biomedical PubMed dataset. Additional pre-trained checkpoints using a biomedical vocabulary are available:

    • BioMegatron345m-biovocab-30k-uncased is a 345 million parameter Megatron model that was pretrained on an uncased biomedical PubMed dataset, using a 30K uncased biomedical domain vocabulary.

    • BioMegatron345m-biovocab-50k-uncased is a 345 million parameter Megatron model that was pretrained on an uncased biomedical PubMed dataset, using a 50K uncased biomedical domain vocabulary.

BioBERT

  • BioBERTBaseCasedForNeMo is a NeMo compatible checkpoint for the BioBERT Base Cased model that was converted from the checkpoint found here.

  • BioBERTLargeCasedForNeMo is a NeMo compatible checkpoint for the BioBERT Large Cased model that was converted from the checkpoint found here.

NVIDIA NeMo

  • NVIDIA NeMo is an open source toolkit for conversational AI. It is built for data scientists and researchers to build new state of the art speech and NLP networks easily through API compatible building blocks that can be connected together.

Technical Support

Use the NVIDIA DevTalk Forum for questions regarding this software.

References

[1] Shoeybi, M. et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv [cs.CL] (2019)

[2] Narayanan, D. et al. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. arXiv [cs.CL] (2021) doi:10.1145/3458817.3476209

[3] Shin, H.-C. et al. BioMegatron: Larger Biomedical Domain Language Model. arXiv [cs.CL] (2020)

[4] Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. arXiv [cs.CL] (2019) doi:10.1093/database/bay073/5055578

[5] Yang, X. et al. GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records. arXiv [cs.CL] (2022)