Linux / amd64
The Domain Specific - NeMo Automatic Speech Recognition (ASR) Application facilitates training, evaluation and performance comparison of ASR models. This NeMo application enables you to train or fine-tune pre-trained ASR models with your own data. Through this application, we empower you to create your own ASR models built for your domain specific data. Developers have complete control over their data unlike when using a “black box” ASR tool available in the cloud, giving you the ability to create better performing ASR models for your use case.
The Domain Specific - NeMo ASR Application is a packaged easy to use end-to-end ASR system that facilitates:
We use the NVIDIA Neural Modules (NeMo) as the underlying ASR engine. NeMo is a toolkit for building Conversational AI applications. Through modular Deep Neural Networks (DNN) development, NeMo enables fast experimentation by connecting modules, mixing and matching components. Neural Modules are conceptual blocks of neural networks that take typed inputs and produce typed outputs, these typically represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations. The toolkit comes with extendable collections of pre-built modules for automatic speech recognition (ASR), natural language processing (NLP) and text synthesis (TTS).
Please review the following resources, you can also use these to post your questions.
Getting started with the application is very simple.
This image contains the complete Domain Specific NeMo ASR application (including NeMo, notebooks, tools and scripts).
1. Download the container from NGC
docker pull nvcr.io/nvidia/nemo_asr_app_img:20.07
2. Run the application
To run the application, you need to mount the directory where the training datasets live and the results will be saved to. In the run command below, we use the
/data/asr directory as the
export DATA_DIR="/data/asr" \ docker run --gpus all -it --rm --name run_nemo_asr_app \ --ipc=host \ --env DATA_DIR=$DATA_DIR \ -v $DATA_DIR:$DATA_DIR \ -p 8888:8888 nvcr.io/nvidia/nemo_asr_app_img:20.07
Note: Depending on your docker version you may have to use ‘docker run --runtime=nvidia’ or remove ‘--gpus all’
In our application, we provide a complete end-to-end workflow to enable domain adaptation of ASR models using your own data. This is done through the following steps:
The application is divided into the following sturcutre:
In our application we cover the following use cases:
English to Spanish Cross Language Finetuning:
In this use case, we finetune an English acoustic model with Spanish data to create a SOTA Spanish ASR model. Specifically, we fine-tune a pre-trained English ASR model trained on five English datasets with a Spanish open source dataset from Common Voice. Common Voice is a multi-language dataset that anyone can use to train speech-enabled applications.
Wall Street Journal Finetuning:
For this second use case, we perform transfer learning or domain adaptation from old fiction books to modern business news. We use a pre-trained model, QuartzeNet 15x5, found in NGC. This model is pre-trained on the opensource English datasets LibriSpeech and English - Mozilla Common Voice. We fine tune this model with The Wall Street Journal (WSJ) news datasets, WSJ CSR-1 and WSJ CSR-2. To run this use case you need to bring your own dataset or download and prepare the WSJ dataset used in our example WSJ CSR-1 and WSJ CSR-2. Note, you need a license to download the WSJ dataset.
To simplify and enable reproducibility of the ASR workflow, our application allows you to create a project which enables the tracking of datasets, models and configurations. Everything related to a project is saved to disk in a manifest that can be accessed through its
project_id. At the start of the project, the manifest is pre-populated with the baseline pre-trained model.
Additionally, we enable experiment tracking with Weights and Bias which provides you with in depth hyperparameter tracking and collaboration of projects and experiments.
The NeMo ASR acoustic model used in this application is QuartzNet. The QuartzNet model is based on Jasper and can achieve the same performance but with less parameters (from about 333M to about 19M). This model consists of separable convolutions and larger filters, often denoted by QuartzNet_[BxR], where B is the number of blocks, and R - the number of convolutional sub-blocks within a block. Each sub-block contains a 1-D separable convolution, batch normalization, ReLU, and dropout. To learn more about NeMo’s ASR models refer to this tutorial.
Jasper and QuartzNet are CTC-based end-to-end models, which can predict a transcript directly from an audio input, without additional alignment information.
To learn more about the NeMo ASR engine training and evaluation workflows, refer to the
jasper_eval.py scripts inside the tools folder.
Image Source: CTC Networks and Language Models: Prefix Beam Search Explained
The typical ASR-CTC pipeline is shown in the Figure above. Here, the CTC network creates a probability CTC matrix, from the audio input, where columns represent a timestep and rows correspond to a letter in our alphabet, note the probabilities of each column (across all letters) sum to 1. For prediction using max decoding or greedy decoding, the letter with the highest probability at each timestep is chosen, in other words a temporal softmax output layer is used. Next, the repeated characters are removed or collapsed, and blank tokens are discarded. Additionally, a language model can be used to solve ambiguities in the transcription or softmax output, with the help of linguistic knowledge provided by a prefix beam search. To learn more refer to this link.
The CTC-ASR training pipeline followed by NeMo is shown in the following figure:
Image Source: NVIDIA Neural Modules: NeMo