NGC | Catalog
CatalogModelsBioMegatron345m-biovocab-30k-cased

BioMegatron345m-biovocab-30k-cased

Logo for BioMegatron345m-biovocab-30k-cased
Description
Megatron 345m parameters model with biomedical vocabulary (30k size) cased, pre-trained on PubMed biomedical text corpus.
Publisher
NVIDIA NeMo
Latest Version
1
Modified
April 4, 2023
Size
1.26 GB

Overview

This is a nemo file for BioMegatron 345m with biomedical domain vocabulary (30k size), uncased. BioMegatron is Megatron pretrained on PubMed, a biomedical domain dataset, which gives improved results on a range of biomedical downstream tasks. The model has around 345 million paramters.

Please be sure to download the latest version in order to ensure compatibility with the latest NeMo release.

Model Architecture

NeMo Megatron is a new capability in the NeMo framework that allows developers to effectively train and scale language models to billions of parameters. Unlike BERT, the position of the layer normalization and the residual connection in the model architecture (similar to GPT-2 architucture) are swapped, which allowed the models to continue to improve as they were scaled up. This model reaches higher scores compared to BERT on a range of Natural Language Processing (NLP) tasks. BioMegatron has the same network architecture as the Megatron, but is pretrained on a different dataset - PubMed, a large biomedical text corpus, which achieves better performance in biomedical downstream tasks than the original Megatron.

This 345m papameter model has 24 layers (Transformer blocks), 1024 hidden-units, and 16 attention heads. It uses the biomedical domain vocabulary (30k size), uncased.

For more information about NeMo Megatron visit https://github.com/NVIDIA/NeMo

Training

Training BioMegatron was done using the Megatron-LM codebase based on PyTorch.

The entire pre-training takes about 400 hours on 8 DGX-2 machines with Tesla V100 GPUs. Loss function and hyper-parameter settings are the same as pre-training the BERT language models with the Megatron-LM codebase.

Dataset

Pre-training was done on 4.5 billion-word PubMed abstract set and the 1.6 billion-word CC0-licensed Commercial Use Collection of the PMC full-text corpus.

How to use this Model

NVIDIA NeMo can be used for easy fine-tuning to a number of different tasks. Tutorial notebooks on fine-tuning the model for Named Entity Recognition, Relation Extraction can be found on the tutorials page of NeMo.

Source code and developer guide is available at https://github.com/NVIDIA/NeMo Refer to documentation at https://docs.nvidia.com/deeplearning/nemo/neural-modules-release-notes/index.html

In the following we show examples for how to finetune BioMegatron on different downstream tasks.

Usage example 1: Finetune on RE dataset ChemProt https://github.com/NVIDIA/NeMo/blob/r1.7.2/tutorials/nlp/Relation_Extraction-BioMegatron.ipynb

Usage example 2: Finetune on NER dataset NBCI https://github.com/NVIDIA/NeMo/blob/r1.7.2/tutorials/nlp/Token_Classification-BioMegatron.ipynb

Limitations

No known limitations available at this time.

References

  1. Shin, Hoo-Chang and Zhang, Yang and Bakhturina, Evelina and Puri, Raul and Patwary, Mostofa and Shoeybi, Mohammad and Mani, Raghav, 2020, November. BioMegatron: Larger Biomedical Domain Language Model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4700--4706).

Licence

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.