The TokenClassification Model in TAO supports Named Entity Recognition (NER) and other token-level classification tasks, as long as the data follows the format specified below. This model card will focus on the NER task.
Named entity recognition (NER), also referred to as entity chunking, identification or extraction, is the task of detecting and classifying key information (entities) in text. In other words, a NER model takes a piece of text as input and for each word in the text, the model identifies a category the word belongs to.
For example, in a sentence:
Mary lives in Santa Clara and works at NVIDIA, the model should detect that
Mary is a person,
Santa Clara is a location and
NVIDIA is a company.
Trained or fine-tuned NeMo models (with the file extenstion
.nemo) can be converted to Riva models (with the file extension
.riva) and then deployed. Here is a pre-trained Riva model for NER with BERT in English.
The current version of the Named Entity Recognition model consists of the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model  followed by a token classification head. All model parameters are jointly fine-tuned on the downstream task. More specifically, an input text is fed to the BERT model, and then the [CLS] representation of the text sequence is passed to the classification layer(s).
The model was trained with NeMo BERT base uncased checkpoint.
The model was trained on GMB (Groningen Meaning Bank) corpus for entity recognition. The GMB dataset is a fairly large corpus with a lot of annotations. Note, that GMB is not completely human annotated and it’s not considered 100% correct. The data is labeled using the IOB format (short for inside, outside, beginning). The following classes appear in the dataset:
For this model, the classes ART, EVE, and NAT were combined into a MISC class due to small number of examples for these classes.
Pre-processed data that was used for training and evaluation could be found here.
Each word in the input sequence could be split into one or more tokens, as a result, there are two possible ways of the model evaluation:
Here, the first approach was applied, and the predictions for the first token of the input were used to label the whole word. Due to high class unbalancing, the suggested metric for this model is F1 score (with macro averaging).
Evaluation on the GMB dataset dev set:
precision recall f1-score support O (label id: 0) 0.9913 0.9917 0.9915 131141 B-GPE (label id: 1) 0.9574 0.9420 0.9496 2362 B-LOC (label id: 2) 0.8402 0.9037 0.8708 5346 B-MISC (label id: 3) 0.4124 0.3077 0.3524 130 B-ORG (label id: 4) 0.7732 0.6805 0.7239 2980 B-PER (label id: 5) 0.8335 0.8510 0.8422 2577 B-TIME (label id: 6) 0.9176 0.9133 0.9154 2975 I-GPE (label id: 7) 0.8889 0.3478 0.5000 23 I-LOC (label id: 8) 0.7782 0.7835 0.7808 1030 I-MISC (label id: 9) 0.3036 0.2267 0.2595 75 I-ORG (label id: 10) 0.7712 0.7466 0.7587 2384 I-PER (label id: 11) 0.8710 0.8820 0.8765 2687 I-TIME (label id: 12) 0.8255 0.8273 0.8264 938 accuracy 0.9689 154648 macro avg 0.7818 0.7234 0.7421 154648 weighted avg 0.9685 0.9689 0.9686 154648
The model is available for use in the NeMo toolkit , and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
The model takes text as input.
The model outputs category labels for each token in the input text.
import nemo import nemo.collections.nlp as nemo_nlp model = nemo_nlp.models.TokenClassificationModel.from_pretrained(model_name="ner_en_bert") model.add_predictions(['we bought four shirts from the nvidia gear store in santa clara.', 'NVIDIA is a company.'])
The length of the input text is currently constrained by the maximum sequence length of the BERT base uncased model, which is 512 tokens after tokenization. The punctuation model supports commas, periods and question marks.