wav2vec 2.0 Base fine-tuned

wav2vec 2.0 Base fine-tuned

Logo for wav2vec 2.0 Base fine-tuned
wav2vec 2.0 base PyTorch checkpoint pre-trained on LibriSpeech 960 h, fine-tuned on LibriSpeech 960 h
NVIDIA Deep Learning Examples
Latest Version
April 4, 2023
915.49 MB

Model Overview

A framework for self-supervised learning of speech representations which masks latent representations of the raw waveform and solves a contrastive task over quantized speech representations.

Model Architecture

The model takes raw waveforms as its input. A fully convolutional feature extractor reduces the resolution of the signal to a single vector roughly every 20 ms. Most of the computation is performed in the transformer encoder part of the model. The outputs of the transformer, and quantized outputs from the feature extractor, serve as inputs to the contrastive loss. During fine-tuning, this loss is replaced with the CTC loss, and quantization is not performed.

wav2vec 2.0 model architecture

Figure 1. The architecture of wav2vec 2.0 ([source](https://proceedings.neurips.cc/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf)). The model is composed of a convolutional feature extractor, and a transformer encoder. During fine-tuning, quantization is disabled and contrastive loss is replaced with the CTC loss function.

In addition, our model uses the Hourglass Transformer architecture for the encoder. This architecture uses fixed-sized pooling in order to reduce the time dimension T of the signal, and thus, lower the O(T²) cost of the self-attention mechanism.

The Hourglass Transformer module

Figure 2. The Hourglass Transformer module ([source](https://arxiv.org/abs/2110.13711)). The signal is processed by the initial layers and downsampled. Most of the layers operate on the downsampled signal. Finally, the signal is upsampled for the final layers. The Hourglass Transformer replaced a regular stack of transformer layers, typically improving throughput and lowering memory consumption.


This model was trained using script available on NGC and in GitHub repo.


The following datasets were used to train this model:

  • LibriSpeech - Corpus of approximately 1000 hours of 16kHz read English speech derived from audiobooks from the LibriVox project, carefully segmented and aligned.


Performance numbers for this model are available in NGC.



This model was trained using open-source software available in Deep Learning Examples repository. For terms of use, please refer to the license of the script and the datasets the model was derived from.