Automatic Speech Recognition (ASR) systems typically generate text with no punctuation and capitalization of the words. Besides being hard to read, the ASR output could be an input to named entity recognition, machine translation or text-to-speech models. If the input text has punctuation and words are capitalized correctly, this could potentially boost the performance of such models.
For each word in the input text, the model:
- predicts a punctuation mark that should follow the word (if any). The model supports commas, periods and question marks.
- predicts if the word should be capitalized or not.
The Punctuation and Capitalization model consists of the pre-trained DistilBERT followed by two token classification heads. One classification head is responsible for the punctuation task, the other one handles the capitalization task. Both token level classification heads take the DistilBERT encoded representation of the [CLS] token as input. Such architecture allows this model to solve two tasks at once with only a single pass through the DistilBERT. Finally, all the parameters are fine-tuned on this joint task.
The model was trained with Huggingface DistilBERT base uncased checkpoint.
The model was trained on a subset of data from the following sources:
- Tatoeba sentences
- Books from the Project Gutenberg that were used as part of the LibriSpeech corpus
- Transcripts from Fisher English Training Speech
Each word in the input sequence could be split into one or more tokens, as a result, there are two possible ways of the model evaluation:
- marking the whole entity as a single label
- perform an evaluation on the sub token level
During training, the first approach was applied, and the predictions for the first token of the input were used to label the whole word. Each task is evaluated separately. Due to the high class unbalancing, the suggested metric for this model is F1 score (with macro averaging).
This model was evaluated on an internal dataset, and it reached the F1 score of 77%.
How to use this model
The model is available for use in the NeMo toolkit , and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
Automatically load the model from NGC
import nemo.collections.nlp as nemo_nlp
model = nemo_nlp.models.PunctuationCapitalizationModel.from_pretrained(model_name="punctuation_en_distilbert")
Use the model to add punctuation and capitalization
model.add_punctuation_capitalization(['how are you', 'great how about you'])
The model accepts lower cased English text without punctuation marks.
Text with punctuation and capitalization restored.
The length of the input text is currently constrained by the maximum sequence length of the DistilBERT base uncased model, which is 512 tokens after tokenization. The punctuation model supports commas, periods and question marks.
 Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv preprint arXiv:1910.01108 (2019).