NVIDIA
NVIDIA
NeMo Natural Language Processing models
Model
NVIDIA
NVIDIA
NeMo Natural Language Processing models

NeMo Natural Language Processing Models contain models for punctuation and capitalization, named entity recognition, text classification as well as base pretrained models

Overview

NVIDIA NeMo toolkit supports multiple Natural Language Processing(NLP) tasks from text classification and language modelling all the way to glue benchmarking. Natural Language Processing (NLP) field experienced a huge leap in recent years due to the concept of transfer learning enabled through pretrained language models. BERT, RoBERTa, Megatron-LM, and many other proposed language models achieve state-of-the-art results on many NLP tasks, such as: question answering, sentiment analysis, named entity recognition and many others. In NeMo, most of the NLP models represent a pretrained language model followed by a Token Classification layer or a Sequence Classification layer or a combination of both. By changing the language model, you can improve the performance of your final model on the specific downstream task you are solving. With NeMo you can use either pretrain a BERT model from your data or use a pretrained language model from HuggingFace transformers or Megatron-LM libraries. All NLP models require text tokenization as data preprocessing steps. The list of tokenizers can be found in nemo.collections.common.tokenizers, and include WordPiece tokenizer, SentencePiece tokenizer or simple tokenizers like Word tokenizer.

Language Modelling - Assigns a probability distribution over a sequence of words. Can be either generative e.g. left-right-transformer or BERT with a masked language model loss. Text Classification - Classifies an entire text based on its content into predefined categories, e.g. news, finance, science etc. These models are BERT-based and can be used for applications such as sentiment analysis, relationship extraction Token Classification - Classifies each input token separately. Models are based on BERT. Applications include named entity recognition, punctuation and capitalization, etc. Intent Slot Classification - used for joint recognition of Intents and Slots (Entities) for building conversational assistants. Question Answering - Currently only SQuAD is supported. This takes in a question and a passage as input and predicts a span in the passage, from which the answer is extracted. Glue Benchmarks - A benchmark of nine sentence- or sentence-pair language understanding tasks.

Use the Jupyter notebooks to quickly get started with using the pre-trained checkpoints or pretraining BERT.

Usage

You can instantiate all these models automatically directly from NGC. To do so, start your script with:

import nemo
import nemo.collections.nlp as nemo_nlp

Then choose what type of model you would like to instantiate. See table below for the list of model base classes. Then use base_class.from_pretrained(...) method. For example:

pretrained_ner_model = nemo_nlp.models.TokenClassificationModel.from_pretrained(model_name="NERModel")

Note that you can also list all available models using API by calling base_class.list_available_models(...) method.

You can also download all models' ".nemo" files in the "File Browser" tab and then instantiate those models with base_class.restore_from(PATH_TO_DOTNEMO_FILE) method. In this case, make sure you are matching NeMo and models' versions.

Here is a list of currently available models together with their base classes and short descriptions.

Model nameModel Base ClassDescription
NERModelTokenClassificationModelThe model was trained on GMB (Groningen Meaning Bank) corpus for entity recognition and achieves 74.61 F1 Macro score.
Punctuation_Capitalization_with_BERTTokenClassificationModelThe model was trained with NeMo BERT base uncased checkpoint on a subset of data from the following sources: Tatoeba sentences, books from Project Gutenberg, Fisher transcripts.
Punctuation_Capitalization_with_DistilBERTTokenClassificationModelThe model was trained with DiltilBERT base uncased checkpoint from HuggingFace on a subset of data from the following sources: Tatoeba sentences, books from Project Gutenberg, Fisher transcripts.
BERTBaseUncasedSQuADv1.1QAModelQuestion answering model finetuned from NeMo BERT Base Uncased on SQuAD v1.1 dataset which obtains an exact match (EM) score of 82.43% and an F1 score of 89.59%.
BERTBaseUncasedSQuADv2.0QAModelQuestion answering model finetuned from NeMo BERT Base Uncased on SQuAD v2.0 dataset which obtains an exact match (EM) score of 73.35% and an F1 score of 76.44%.
BERTLargeUncasedSQuADv1.1QAModelQuestion answering model finetuned from NeMo BERT Large Uncased on SQuAD v1.1 dataset which obtains an exact match (EM) score of 85.47% and an F1 score of 92.10%.
BERTLargeUncasedSQuADv2.0QAModelQuestion answering model finetuned from NeMo BERT Large Uncased on SQuAD v2.0 dataset which obtains an exact match (EM) score of 78.8% and an F1 score of 81.85%.
Joint_Intent_Slot_AssistantIntentSlotClassificationModelThis models is trained on this https://github.com/xliuhw/NLU-Evaluation-Data dataset which includes 64 various intents and 55 slots. Final Intent accuracy is about 87%, Slot accuracy is about 89%.
Publisher
NVIDIA
NVIDIA
Latest Version1.0.0a5
UpdatedApril 4, 2023 UTC
Compressed Size4.43 GB

NVIDIA uses cookies to improve your experience on our web site. We and our third-party partners also use cookies and other tools to collect and record information you provide as well as information about your interactions with our websites for performance improvement, analytics, and to assist in marketing efforts. By clicking "Accept All", you consent to our use of cookies and other tools as described in our Cookie Policy. You can manage your cookie settings by clicking on "Manage Settings." By continuing to use this site or by clicking one of the buttons below, you agree to our Terms of Service (which contains important waivers). Please see our Privacy Policy for more information on our privacy practices.