NGC | Catalog

n-gram-training.ipynb

N-gram Language Modelling using Transfer Learning Toolkit

Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task.

Train Adapt Optimize (TAO) Toolkit is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data. Developers, researchers and software partners building Conversational AI and Vision AI can leverage TAO Toolkit to avoid the hassle of training from scratch, and significantly accelerate their workflow.

<\center>

Learning Objectives

In this notebook, you will learn how to leverage the simplicity and convenience of TAO Toolkit to:

The earlier sections in the notebook give a brief introduction to the N-gram Language Modelling task, the datasets used for training and evaluating our N-gram language model. If you are already familiar with these, and want to jump right in, you can start at section on Data Preparation.


Pre-requisites

For ease of use, please install TAO Toolkit inside a python virtual environment. We recommend performing this step first and then launching the notebook from the virtual environment.

Let's install TAO Toolkit. It is a simple pip install!

In [1]:
! pip install nvidia-pyindex
! pip install nvidia-tao

To see the docker image versions and the tasks that tao can perform, use the tao info command.

In [2]:
!tao info --verbose

In addition to installing TAO Toolkit package, please make sure of the following software requirements:

  1. python 3.6.9
  2. docker-ce > 19.03.5
  3. docker-API 1.40
  4. nvidia-container-toolkit > 1.3.0-1
  5. nvidia-container-runtime > 3.4.0-1
  6. nvidia-docker2 > 2.5.0-1
  7. nvidia-driver >= 455.23

Check if the GPU device(s) is visible


Language Modelling

Task Description

Language modelling returns a probability distribution over a sequence of words. Besides assigning a probability to a sequence of words, the language models also assign a probability for the likelihood of a given word (or a sequence of words) that follows a sequence of words.

The sentence: all of a sudden I notice three guys standing on the sidewalk would be scored higher than the sentence: on guys all I of notice sidewalk three a sudden standing the by the language model.

A language model trained on large corpus can significantly improve the accuracy of an Automatic Speech Recognition system as suggested in many recent research.

N-gram Language Model

There are primarily two types of Language Models

  • N-gram Language Models: These models use frequency of n-grams to learn the probability distribution over words. Two benefits of N-gram Language Model are simplicity and scalability – with larger n, a model can store more context with a well-understood space–time tradeoff, enabling small experiments to scale up efficiently.
  • Neural Language Models: They use different kinds of Neural Networks to model the probability distribution over words, and have surpassed the N-gram language models in the ability to model language, but are generally slower to evaluate.

In this notebook, we will show how to train, evaluate and optionally finetune a N-gram language model leveraging TAO Toolkit.


Preparing the dataset

Librispeech LM Normalized dataset

For this tutorial, we use the normalized version of Librispeech LM dataset to train our N-gram language model. The normalized version of Librispeech LM dataset is available here.

Librispeech dev-clean dataset

For this tutorial, we also use the clean version of Librispeech development set to evaluate our N-gram language model. The clean version of Librispeech development set is available here.

Downloading the dataset

Librispeech LM Normalized dataset

The training data is publicly available here and can be downloaded directly.

In [3]:
import os
# IMPORTANT NOTE: Set path to a folder where you want you data and results to be saved
# TODO
DATA_DOWNLOAD_DIR = "<YOUR_PATH_TO_DATA_DIR>"
assert os.path.exists(DATA_DOWNLOAD_DIR), "Provided DATA_DOWNLOAD_DIR does not exist."
In [4]:
# NOTE: Ensure that wget and unzip utilities are available. If not, please install them
!wget 'https://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz' -P $DATA_DOWNLOAD_DIR

# Extract the data
!gzip -dk $DATA_DOWNLOAD_DIR/librispeech-lm-norm.txt.gz

Librispeech dev-clean dataset

The evaluation data is publicly available here and can be downloaded directly. We provided a Python script to download and preprocess the dataset for users

In [5]:
"""
Scripts to download and preprocess LibriSpeech dev-clean
"""
from multiprocessing import Pool

import numpy

LOG_STR = " To regenerate this file, please, remove it."

def find_transcript_files(dir):
    files = []
    for dirpath, _, filenames in os.walk(dir):
        for filename in filenames:
            if filename.endswith(".trans.txt"):
                files.append(os.path.join(dirpath, filename))
    return files

def transcript_to_list(file):
    audio_path = os.path.dirname(file)
    ret = []
    with open(file, "r") as f:
        for line in f:
            file_id, trans = line.strip().split(" ", 1)
            audio_file = os.path.abspath(os.path.join(audio_path, file_id + ".flac"))
            duration = 0  # We are not using the audio
            ret.append([file_id, audio_file, str(duration), trans.lower()])

    return ret


if __name__ == "__main__":
    
    name = "dev-clean"
    data_path = os.path.join(DATA_DOWNLOAD_DIR, "eval_data")
    text_path = os.path.join(DATA_DOWNLOAD_DIR, "text")
    lists_path = os.path.join(DATA_DOWNLOAD_DIR, "lists")
    os.makedirs(data_path, exist_ok=True)
    os.makedirs(text_path, exist_ok=True)
    os.makedirs(lists_path, exist_ok=True)
    data_http = "http://www.openslr.org/resources/12/"

    # Download the audio data
    print("Downloading the evaluation data.", flush=True)
    if not os.path.exists(os.path.join(data_path, "LibriSpeech", name)):
        print("Downloading and unpacking {}...".format(name))
        cmd = """wget -c {http}{name}.tar.gz -P {path};
                 yes n 2>/dev/null | gunzip {path}/{name}.tar.gz;
                 tar -C {path} -xf {path}/{name}.tar"""
        os.system(cmd.format(path=data_path, http=data_http, name=name))
    else:
        log_str = "{} part of data exists, skip its downloading and unpacking"
        print(log_str.format(name) + LOG_STR, flush=True)

    # Prepare the audio data
    print("Converting data into necessary format.", flush=True)
    word_dict = {}
    word_dict[name] = set()
    src = os.path.join(data_path, "LibriSpeech", name)
    assert os.path.exists(src), "Unable to find the directory - '{src}'".format(
        src=src
    )

    dst_list = os.path.join(lists_path, name + ".lst")
    if os.path.exists(dst_list):
        print(
            "Path {} exists, skip its generation.".format(dst_list) + LOG_STR,
            flush=True,
        )
        

    print("Analyzing {src}...".format(src=src), flush=True)
    transcript_files = find_transcript_files(src)
    transcript_files.sort()

    print("Writing to {dst}...".format(dst=dst_list), flush=True)
    with Pool(processes=8) as p:
        samples = list(p.imap(transcript_to_list, transcript_files))

    with open(dst_list, "w") as fout:
        for sp in samples:
            for s in sp:
                word_dict[name].update(s[-1].split(" "))
                s[0] = name + "-" + s[0]
                fout.write(" ".join(s) + "\n")

    current_path = os.path.join(text_path, name + ".txt")
    if not os.path.exists(current_path):
        with open(os.path.join(lists_path, name + ".lst"), "r") as flist, open(
            os.path.join(text_path, name + ".txt"), "w"
        ) as fout:
            for line in flist:
                fout.write(" ".join(line.strip().split(" ")[3:]) + "\n")
    else:
        print(
            "Path {} exists, skip its generation.".format(current_path) + LOG_STR,
            flush=True,
        )

print("Done!", flush=True)

For the sake of reducing the time this demo takes, we reduce the number of lines of the training dataset. Feel free to modify the number of used lines.

In [6]:
# Use a random 10,000 lines for training
!shuf -n 10000 $DATA_DOWNLOAD_DIR/librispeech-lm-norm.txt  > $DATA_DOWNLOAD_DIR/reduced_training.txt

TAO Toolkit workflow

The rest of the notebook shows what a sample TAO Toolkit workflow looks like.

Setting TAO Toolkit Mounts

Now that our dataset has been downloaded, an important step in using TAO Toolkit is to set up the directory mounts. The TAO Toolkit launcher uses docker containers under the hood, and for our data and results directory to be visible to the docker, they need to be mapped. The launcher can be configured using the config file ~/.tao_mounts.json. Apart from the mounts, you can also configure additional options like the Environment Variables and amount of Shared Memory available to the TAO Toolkit launcher.

IMPORTANT NOTE: The code below creates a sample ~/.tao_mounts.json file. Here, we can map directories in which we save the data, specs, results and cache. You should configure it for your specific case such your these directories are correctly visible to the docker container. Please also ensure that the source directories exist on your machine!

In [7]:
%%bash
tee ~/.tao_mounts.json <<'EOF'
{
   "Mounts":[
       {
           "source": "<YOUR_PATH_TO_DATA_DIR>",
           "destination": "/data"
       },
       {
           "source": "<YOUR_PATH_TO_SPECS_DIR>",
           "destination": "/specs"
       },
       {
           "source": "<YOUR_PATH_TO_RESULTS_DIR>",
           "destination": "/results"
       },
       {
           "source": "<YOUR_PATH_TO_CACHE_DIR eg. /home/user/.cache>",
           "destination": "/root/.cache"
       }
   ]
}
EOF
In [8]:
# Make sure the source directories exist, if not, create them
# ! mkdir <YOUR_PATH_TO_SPECS_DIR>
# ! mkdir <YOUR_PATH_TO_RESULTS_DIR>
# ! mkdir <YOUR_PATH_TO_CACHE_DIR>

The rest of the notebook exemplifies the simplicity of the TAO Toolkit workflow. Users with basic knowledge of Deep Learning can get started building their own custom models using a simple specification file. It's essentially just one command each to run data preprocessing, training, fine-tuning, evaluation, inference, and export! All configurations happen through YAML spec files


Configuration/Specification Files

The essence of all commands in TAO Toolkit lies in the YAML spec files. There are sample spec files already available for you to use directly or as reference to create your own. Through these spec files, you can tune many knobs like the model, dataset, hyperparameters etc. Each command (like train, finetune, evaluate etc.) should have a dedicated spec file with configurations pertinent to it.

Here is an example of the training spec file:


model:
  intermediate: True
  order: 2
  pruning:
    - 0
training_ds:
  is_tarred: false
  is_file: true
  data_file: ???

vocab_file: ""
encryption_key: "tlt_encode"
...

Set Relevant Paths

Please set these paths according to your environment.

In [9]:
# NOTE: The following paths are set from the perspective of the TAO Toolkit Docker. 

# The data is saved here
DATA_DIR='/data'

# The configuration files are stored here
SPECS_DIR='/specs/n_gram'

# The results are saved at this path
RESULTS_DIR='/results/n_gram'

# Set your encryption key, and use the same key for all commands
KEY='tlt_encode'

Downloading Specs

We can proceed to downloading the spec files. The user may choose to modify/rewrite these specs, or even individually override them through the launcher. You can download the default spec files by using the download_specs command.

The -o argument indicating the folder where the default specification files will be downloaded, and -r that instructs the script where to save the logs. Make sure the -o points to an empty folder!

In [10]:
!tao n_gram download_specs \
    -r $RESULTS_DIR \
    -o $SPECS_DIR

Data Convert

In preparation for training/fine-tuning, we need to preprocess the dataset. tao n_gram dataset_convert command can be used in conjunction with appropriate configuration in the spec file. Here is the sample dataset_convert.yaml spec file we use:

# Dataset. Available options: [assistant]
dataset_name: assistant

# Extension of the files containing in dataset
extension: ???

# Path to the folder containing the dataset source files.
source_data_dir: ???

# Path to the output folder.
target_data_file: ???

We encourage you to take a look at the .yaml spec files we provide! As we show below, you can override the source_data_dir and target_data_dir options with appropriate paths.

In [11]:
# Preprocess training data (Librispeech LM Normalized)
!tao n_gram dataset_convert \
            -e $SPECS_DIR/dataset_convert.yaml \
            -r $RESULTS_DIR/dataset_convert \
            extension=*.txt \
            source_data_dir=$DATA_DIR/reduced_training.txt \
            target_data_file=$DATA_DIR/preprocessed.txt

# Preprocess evaluation data (Librispeech dev-clean)
!tao n_gram dataset_convert \
            -e $SPECS_DIR/dataset_convert.yaml \
            -r $RESULTS_DIR/dataset_convert \
            extension=*.txt \
            source_data_dir=$DATA_DIR/text/dev-clean.txt \
            target_data_file=$DATA_DIR/preprocessed_dev_clean.txt

The command preprocess training and evaluation dataset using basic text preprocessings include convert lowercase, normalization, remove punctuation, ... and write the results into files named preprocessed.txt and preprocessed_dev_clean.txt for training and evaluation correspondingly. In both preprocessed.txt and preprocessed_dev_clean.txt, each preprocessed sentence corresponds to a new line.


Training / Fine-tuning

Training a model using TAO Toolkit is as simple as configuring your spec file and running the train command. The code cell below uses the train.yaml spec file available for users as reference. The spec file configurations can easily be overridden using the tao-launcher CLI as shown below. For instance, below we override model.order, model.pruning and training_ds.data_file configurations to suit our needs.

For training a N-gram language model in TAO Toolkit, we use the tao n_gram train command with the following args:

  • -e: Path to the spec file
  • -k: User specified encryption key to use while saving/loading the model
  • -r: Path to a folder where the outputs should be written. Make sure this is mapped in tlt_mounts.json
  • Any overrides to the spec file eg. model.order

More details about these arguments are present in the TAO Toolkit Getting Started Guide
Note: All file paths correspond to the destination mounted directory that is visible in the TAO Toolkit docker container used in backend.

In [12]:
!tao n_gram train \
            -e $SPECS_DIR/train.yaml \
            -r $RESULTS_DIR/train \
            training_ds.data_file=$DATA_DIR/preprocessed.txt \
            model.order=3 \
            model.pruning=[0,0,1]

The train command produces 3 files called train_n_gram.arpa, train_n_gram.vocab and train_n_gram.kenlm_intermediate saved at $RESULTS_DIR/train/checkpoints.


Evaluation

The evaluation spec .yaml is as simple as:

# Name of the .arpa or .binary file where trained model will be restored from.
restore_from: ???

test_ds:
  data_file: ???
  
In [13]:
!tao n_gram evaluate \
     -e $SPECS_DIR/evaluate.yaml \
     -r $RESULTS_DIR/evaluate \
     restore_from=$RESULTS_DIR/train/checkpoints/train_n_gram.arpa \
     test_ds.data_file=$DATA_DIR/preprocessed_dev_clean.txt

The output of Evaluation give us the perplexity of the N-gram language model on the evaluation (Librispeech dev-clean) dataset!


Inference

Inference using a trained .arpa or .binary model uses the tao n_gram infer command.
The infer.yaml is also very simple, and we can directly give inputs for the model to run inference.

# "Simulate" user input:
input_batch:
  - 'set alarm for seven thirty am'
  - 'lower volume by fifty percent'
  - 'what is my schedule for tomorrow'

restore_from: ???

We encourage you to try out your own inputs as an exercise!

In [14]:
!tao n_gram infer \
            -e $SPECS_DIR/infer.yaml \
            -r $RESULTS_DIR/infer \
            restore_from=$RESULTS_DIR/train/checkpoints/train_n_gram.arpa

This command returns the log likelihood, perplexity and all n-grams for each of the input sequences that users provided.


Export to RIVA

With TAO Toolkit, you can also export your model in a format that can deployed using Nvidia Riva, a highly performant application framework for multi-modal conversational AI services using GPUs! The export command will convert the trained language model from .arpa to .binary with the option of quantizing the model binary. We will set export_format in the spec file to RIVA to create a .riva file contains the language model binary and its corresponding vocabulary.

In [15]:
!tao n_gram export \
            -e $SPECS_DIR/export.yaml \
            -r $RESULTS_DIR/export \
            export_format=RIVA \
            export_to=exported-model.riva \
            restore_from=$RESULTS_DIR/train/checkpoints/train_n_gram.arpa \
            binary_type=trie \
            binary_q_bits=8 \
            binary_b_bits=7 \
            binary_a_bits=256
            

The model is exported as exported-model.riva which is in a format suited for deployment in Riva.


What's Next?

You could use TAO Toolkit to build custom models for your own applications, or you could deploy the custom model to Nvidia Riva!