NVIDIA Deep Learning Examples
NVIDIA Deep Learning Examples
wav2vec 2.0 Base pre-trained
Model
NVIDIA Deep Learning Examples
NVIDIA Deep Learning Examples
wav2vec 2.0 Base pre-trained

wav2vec 2.0 Base PyTorch checkpoint pre-trained on LibriSpeech 960 h

Model Overview

A framework for self-supervised learning of speech representations which masks latent representations of the raw waveform and solves a contrastive task over quantized speech representations.

Model Architecture

The model takes raw waveforms as its input. A fully convolutional feature extractor reduces the resolution of the signal to a single vector roughly every 20 ms. Most of the computation is performed in the transformer encoder part of the model. The outputs of the transformer, and quantized outputs from the feature extractor, serve as inputs to the contrastive loss. During fine-tuning, this loss is replaced with the CTC loss, and quantization is not performed.

wav2vec 2.0 model architecture

Figure 1. The architecture of wav2vec 2.0 ([source](https://proceedings.neurips.cc/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf)). The model is composed of a convolutional feature extractor, and a transformer encoder. During fine-tuning, quantization is disabled and contrastive loss is replaced with the CTC loss function.

In addition, our model uses the Hourglass Transformer architecture for the encoder. This architecture uses fixed-sized pooling in order to reduce the time dimension T of the signal, and thus, lower the O(T²) cost of the self-attention mechanism.

The Hourglass Transformer module

Figure 2. The Hourglass Transformer module ([source](https://arxiv.org/abs/2110.13711)). The signal is processed by the initial layers and downsampled. Most of the layers operate on the downsampled signal. Finally, the signal is upsampled for the final layers. The Hourglass Transformer replaced a regular stack of transformer layers, typically improving throughput and lowering memory consumption.

Training

This model was trained using script available on NGC and in GitHub repo.

Dataset

The following datasets were used to train this model:

  • LibriSpeech - Corpus of approximately 1000 hours of 16kHz read English speech derived from audiobooks from the LibriVox project, carefully segmented and aligned.

Performance

Performance numbers for this model are available in NGC.

References

License

This model was trained using open-source software available in Deep Learning Examples repository.
For terms of use, please refer to the license of the script and the datasets the model was derived from.

Publisher
NVIDIA Deep Learning Examples
NVIDIA Deep Learning Examples
Latest Version22.09.0_amp-bf16
UpdatedApril 4, 2023 UTC
Compressed Size909.32 MB

NVIDIA uses cookies to improve your experience on our web site. We and our third-party partners also use cookies and other tools to collect and record information you provide as well as information about your interactions with our websites for performance improvement, analytics, and to assist in marketing efforts. By clicking "Accept All", you consent to our use of cookies and other tools as described in our Cookie Policy. You can manage your cookie settings by clicking on "Manage Settings." By continuing to use this site or by clicking one of the buttons below, you agree to our Terms of Service (which contains important waivers). Please see our Privacy Policy for more information on our privacy practices.