NGC | Catalog

speech-to-text-training.ipynb

Train Adapt Optimize (TAO) Toolkit

Train Adapt Optimize (TAO) Toolkit is a python based AI toolkit for taking purpose-built pre-trained AI models and customizing them with your own data.

Transfer learning extracts learned features from an existing neural network to a new one. Transfer learning is often used when creating a large training dataset is not feasible.

Developers, researchers and software partners building intelligent vision AI apps and services, can bring their own data to fine-tune pre-trained models instead of going through the hassle of training from scratch.

Train Adapt Optimize (TAO) Toolkit

The goal of this toolkit is to reduce that 80 hour workload to an 8 hour workload, which can enable data scientist to have considerably more train-test iterations in the same time frame.

Let's see this in action with a use case for Automatic Speech Recognition!

Automatic Speech Recognition

Automatic Speech Recognition (ASR) is often the first step in building a Conversational AI model. An ASR model converts audible speech into text. The main metric for these models is to reduce Word Error Rate (WER) while transcribing the text. Simply put, the goal is to take an audio file and transcribe it.

In this work, we are going to discuss the CitriNet model, which is an end to end ASR model which take in audio and produce text.

CitriNet is a descendent of QuartzNet that features the squeeze-and-excitation(SE) block and subword tokenization and has a better accuracy/performance than QuartzNet.

CitriNet with CTC


Let's Dig in: ASR using TAO

Installing and setting up TAO

For ease of use, please install TAO inside a python virtual environment. We recommend performing this step first and then launching the notebook from the virtual environment.

In addition to installing TAO python package, please make sure of the following software requirements:

  1. python 3.6.9
  2. docker-ce > 19.03.5
  3. docker-API 1.40
  4. nvidia-container-toolkit > 1.3.0-1
  5. nvidia-container-runtime > 3.4.0-1
  6. nvidia-docker2 > 2.5.0-1
  7. nvidia-driver >= 455.23

Let's install TAO. It is a simple pip install!

In [1]:
! pip install nvidia-pyindex
! pip install nvidia-tao

After installing TAO, the next step is to setup the mounts for TAO. The TAO launcher uses docker containers under the hood, and for our data and results directory to be visible to the docker, they need to be mapped. The launcher can be configured using the config file ~/.tao_mounts.json. Apart from the mounts, you can also configure additional options like the Environment Variables and amount of Shared Memory available to the TAO launcher.

IMPORTANT NOTE: The code below creates a sample ~/.tao_mounts.json file. Here, we can map directories in which we save the data, specs, results and cache. You should configure it for your specific case so these directories are correctly visible to the docker container.

In [2]:
# please define these paths on your local host machine
%env HOST_DATA_DIR=/path/to/your/host/data
%env HOST_SPECS_DIR=/path/to/your/host/specs
%env HOST_RESULTS_DIR=/path/to/your/host/results
In [3]:
! mkdir -p $HOST_DATA_DIR
! mkdir -p $HOST_SPECS_DIR
! mkdir -p $HOST_RESULTS_DIR
In [4]:
# Mapping up the local directories to the TAO docker.
import json
import os
mounts_file = os.path.expanduser("~/.tao_mounts.json")
tlt_configs = {
   "Mounts":[
       {
           "source": os.environ["HOST_DATA_DIR"],
           "destination": "/data"
       },
       {
           "source": os.environ["HOST_SPECS_DIR"],
           "destination": "/specs"
       },
       {
           "source": os.environ["HOST_RESULTS_DIR"],
           "destination": "/results"
       },
       {
           "source": os.path.expanduser("~/.cache"),
           "destination": "/root/.cache"
       }
   ],
   "DockerOptions": {
        "shm_size": "16G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
         }
   }
}
# Writing the mounts file.
with open(mounts_file, "w") as mfile:
    json.dump(tlt_configs, mfile, indent=4)
In [5]:
!cat ~/.tao_mounts.json

You can check the docker image versions and the tasks that perform. You can also check this out with a tao --help or

In [6]:
! tao info --verbose

Set Relevant Paths

In [7]:
# NOTE: The following paths are set from the perspective of the TAO Docker.

# The data is saved here
DATA_DIR = "/data"
SPECS_DIR = "/specs"
RESULTS_DIR = "/results"

# Set your encryption key, and use the same key for all commands
KEY = 'tlt_encode'

Now that everything is setup, we would like to take a bit of time to explain the tao interface for ease of use. The command structure can be broken down as follows: tao <task name> <subcommand>

Let's see this in further detail.

Downloading Specs

TAO's Conversational AI Toolkit works off of spec files which make it easy to edit hyperparameters on the fly. We can proceed to downloading the spec files. The user may choose to modify/rewrite these specs, or even individually override them through the launcher. You can download the default spec files by using the download_specs command.

The -o argument indicating the folder where the default specification files will be downloaded, and -r that instructs the script where to save the logs. Make sure the -o points to an empty folder!

In [8]:
# delete the specs directory if it is already there to avoid errors
! tao speech_to_text_citrinet download_specs \
    -r $RESULTS_DIR/speech_to_text_citrinet \
    -o $SPECS_DIR/speech_to_text_citrinet

Download Data

For the purposes of demonstration we will use the popular AN4 dataset. Let's download it.

In [9]:
! wget http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz

After downloading, untar the dataset, and move it to the correct directory.

In [10]:
! tar -xvf an4_sphere.tar.gz 
! mv an4 $HOST_DATA_DIR

Pre-Processing

This step converts the mp3 files into wav files and splits the data into training and testing sets. It also generates a "meta-data" file to be consumed by the dataloader for training and testing.

In [11]:
! tao speech_to_text_citrinet dataset_convert \
    -e $SPECS_DIR/speech_to_text_citrinet/dataset_convert_an4.yaml \
    -r $RESULTS_DIR/citrinet/dataset_convert \
    source_data_dir=$DATA_DIR/an4 \
    target_data_dir=$DATA_DIR/an4_converted

Let's take a listen to a sample audio file

In [12]:
# change path of the file here
import os
import IPython.display as ipd
path = os.environ["HOST_DATA_DIR"] + '/an4_converted/wavs/an268-mbmg-b.wav'
ipd.Audio(path)

Training commands for CitriNet is similar to those of /QuartzNet. Let's have a look!

Training

Create Tokenizer

Before we can do the actual training, we need to pre-process the text. This step is called subword tokenization that creates a subword vocabulary for the text. This is different from Jasper/QuartzNet because only single characters are regarded as elements in the vocabulary in their cases, while in CitriNet the subword can be one or multiple characters. We can use the create_tokenizer command to create the tokenizer that can generate the subword vocabulary for us for use in training below.

In [13]:
!tao speech_to_text_citrinet create_tokenizer \
-e $SPECS_DIR/speech_to_text_citrinet/create_tokenizer.yaml \
-r $RESULTS_DIR/citrinet/create_tokenizer \
manifests=$DATA_DIR/an4_converted/train_manifest.json \
output_root=$DATA_DIR/an4 \
vocab_size=32

We have a very neat interface which allows the end user to configure training parameters from the command line interface.

The process of opening the training script; finding the parameters of interest (which might be spread across multiple files), making the changes needed, and double checking everything is being replaced by a much more easy to use and visible command line interface.

For instance if the number of epochs are needed to be modified along with a change in learning rate, the user can add trainer.max_epochs=10 and optim.lr=0.02 and train the model. Sample commands are given below.

A list of some of the customizable parameters along with their default values is as follows:

trainer:

  • gpus: 1
  • num_nodes: 1
  • max_epochs: 5
  • max_steps: null
  • checkpoint_callback: false

training_ds:

  • sample_rate: 16000
  • batch_size: 32
  • trim_silence: true
  • max_duration: 16.7
  • shuffle: true
  • is_tarred: false
  • tarred_audio_filepaths: null

validation_ds:

  • sample_rate: 16000
  • batch_size: 32
  • shuffle: false
optim:
  • name: adam
  • lr: 0.1
  • betas: [0.9, 0.999]
  • weight_decay: 0.0001

The steps below might take considerable time depending on the GPU being used. For best experience, we recommend using an A100 GPU.

For training an ASR CitriNet model in TAO, we use the tao speech_to_text_citrinet train command with the following args:

  • -e : Path to the spec file
  • -g : Number of GPUs to use
  • -r : Path to the results folder
  • -m : Path to the model
  • -k : User specified encryption key to use while saving/loading the model
  • Any overrides to the spec file eg. trainer.max_epochs

Training CitriNet

In [14]:
!tao speech_to_text_citrinet train \
     -e $SPECS_DIR/speech_to_text_citrinet/train_citrinet_bpe.yaml \
     -g 1 \
     -k $KEY \
     -r $RESULTS_DIR/citrinet/train \
     training_ds.manifest_filepath=$DATA_DIR/an4_converted/train_manifest.json \
     validation_ds.manifest_filepath=$DATA_DIR/an4_converted/test_manifest.json \
     trainer.max_epochs=1 \
     training_ds.num_workers=4 \
     validation_ds.num_workers=4 \
     model.tokenizer.dir=$DATA_DIR/an4/tokenizer_spe_unigram_v32

ASR evaluation

Now that we have a model trained, we need to check how well it performs.

In [15]:
!tao speech_to_text_citrinet evaluate \
     -e $SPECS_DIR/speech_to_text_citrinet/evaluate.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/citrinet/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/citrinet/evaluate \
     test_ds.manifest_filepath=$DATA_DIR/an4_converted/test_manifest.json

ASR finetuning

Once the model is trained and evaluated and there is a need for fine tuning, the following command can be used to fine tune the ASR model. This step can also be used for transfer learning by making changes in the train.json and dev.json files to add new data.

The list for customizations is same as the training parameters with the exception for parameters which affect the model architecture. Also, instead of training_ds we have finetuning_ds

Note: If you wish to proceed with a trained dataset for better inference results, you can find a .nemo model here.

Simply re-name the .nemo file to .tlt and pass it through the finetune pipeline.

Note: The finetune spec files contain specifics to finetune the English model we just trained to Russian. If you wish to proceed with English, please make the changes in the spec file finetune.yaml which you can find in the SPEC_DIR folder you mapped. Be sure to delete older finetuning checkpoints if you choose to change the language after finetuning it as is.

In [16]:
!tao speech_to_text_citrinet finetune \
     -e $SPECS_DIR/speech_to_text_citrinet/finetune.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/citrinet/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/citrinet/finetune \
     finetuning_ds.manifest_filepath=$DATA_DIR/an4_converted/train_manifest.json \
     validation_ds.manifest_filepath=$DATA_DIR/an4_converted/test_manifest.json \
     trainer.max_epochs=1 \
     finetuning_ds.num_workers=20 \
     validation_ds.num_workers=20 \
     trainer.gpus=1 \
     tokenizer.dir=$DATA_DIR/an4/tokenizer_spe_unigram_v32

ASR model export

With TAO, you can also export your model in a format that can deployed using Nvidia Riva, a highly performant application framework for multi-modal conversational AI services using GPUs! The same command for exporting to ONNX can be used here. The only small variation is the configuration for export_format in the spec file!

Export to ONNX

In [17]:
!tao speech_to_text_citrinet export \
     -e $SPECS_DIR/speech_to_text_citrinet/export.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/citrinet/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/citrinet/export \
     export_format=ONNX

Export to Riva

In [18]:
!tao speech_to_text_citrinet export \
     -e $SPECS_DIR/speech_to_text_citrinet/export.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/citrinet/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/citrinet/riva \
     export_format=RIVA \
     export_to=asr-model.riva

ASR Inference

You might have to work with the infer.yaml file to select the files you want for inference

In [19]:
!tao speech_to_text_citrinet infer \
     -e $SPECS_DIR/speech_to_text_citrinet/infer.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/citrinet/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/citrinet/infer \
     file_paths=[$DATA_DIR/an4_converted/wavs/an268-mbmg-b.wav]

ASR Inference using ONNX

TAO provides the capability to use the exported .eonnx model for inference. The command tao speech_to_text infer_onnx is very similar to the inference command for .tlt models. Again, the inputs in the spec file used is just for demo purposes, you may choose to try out your custom input!

In [20]:
!tao speech_to_text_citrinet infer_onnx \
     -e $SPECS_DIR/speech_to_text_citrinet/infer_onnx.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/citrinet/export/exported-model.eonnx \
     -r $RESULTS_DIR/infer_onnx \
     file_paths=[$DATA_DIR/an4_converted/wavs/an268-mbmg-b.wav]

What's Next?

You could use TAO to build custom models for your own applications, or you could deploy the custom model to Nvidia Riva!