Bertbaseuncased | NVIDIA NGC

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

BERT Base Uncased trained on English Wikipedia and BookCorpus

Publisher

NVIDIA

Latest Version

1.0.0rc1

Modified

April 4, 2023

Size

390.62 MB

Model Overview

This is a pre-trained autoencoding language model trained on English Wikipedia and BookCorpus using a sequence length of 512. The model is based on the architecture presented in "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" paper [1].

Model Architecture

The model is based on the architecture presented in "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" paper [1]. In this particular instance, the model has 12 Transformer blocks. It is using WordPiece tokenizer [2].

Training

The model was trained from scratch on preprocessed English Wikipedia and BookCorpus using a sequence length of 512.

Dataset

The model was trained from scratch on preprocessed English Wikipedia and BookCorpus using a sequence length of 512. The processing was done with NVIDIA Deep Learning Examples [4].

Performance

The accuracy of language models are often measured on downstream tasks such as SQuAD [3]. On SQuADv1.1 it reaches EM=82.78, F1=89.97, on SQuADv2.0 EM=75.04, F1=78.08

How to use this model

The model is available for use in the NeMo toolkit [5], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically load the model from NGC

import nemo
import nemo.collections.nlp as nemo_nlp
model = nemo_nlp.models.language_modeling.BERTLMModel.from_pretrained(model_name="bertbaseuncased")

Training Model

python [NEMO_GIT_FOLDER]/examples/nlp/language_modeling/bert_pretraining.py --config-name=bert_pretraining_from_preprocessed_config.yaml

Input

The model takes preprocessed data as input.

Output

The model outputs masked language model loss and optional next sentence prediction.

Limitations

The length of the input text is currently constrained by the maximum sequence length of the model, which is 512 tokens after tokenization.

References

[1] https://arxiv.org/pdf/1810.04805.pdf

[2] https://arxiv.org/abs/1609.08144

[3] https://rajpurkar.github.io/SQuAD-explorer/

[4] https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/data/create_datasets_from_start.sh

[5] NVIDIA NeMo Toolkit

Licence

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.