NGC | Catalog
CatalogModelsAudio Codec 16kHz Small

Audio Codec 16kHz Small

Logo for Audio Codec 16kHz Small
Description
This model card contains a Small Audio Codec model trained on the Libri-Light audiobook recordings dataset, comprising approximately 60,000 hours of English language speech with a 16kHz sampling rate.
Publisher
NVIDIA
Latest Version
v1
Modified
February 27, 2024
Size
56.37 MB

Model Overview

This model card contains a NeMo Audio Codec model trained on the Libri-Light audiobook recordings dataset, comprising approximately 60,000 hours of English language speech with a 16kHz sampling rate. An Audio Codec model offers a low bit rate discrete representation of audio. The current 14M parameter model encodes audio information at a bitrate of 6.4kbps.

Model Architecture

The NeMo Audio Codec model is a non-autoregressive convolutional encoder-quantizer-decoder model for audio codec extraction. This 14M parameter model is trained end-to-end using time-domain loss, discriminative loss, and frequency domain loss, similar to other audio codec works Encodec [3].

Figure: NeMo Audio Codec Model Architecture

Model Sampling rate (kHz) Size (M) Latent dim Codebook size Num of codebooks Framerate (fps) Bitrate (kbps)
EnCodec 24 14.85 128 1024 8 75 6
Audio Codec 16 13.75 128 1024 8 80 6.4

Training

The NeMo Audio Codec Model was trained using the NVIDIA NeMo [4] toolkit for 130k steps with an effective batch size of 256 on a single node of 8 NVIDIA V100s. The current model can be trained using this example script and this base configuration.

Datasets

The NeMo Audio Codec model is trained on over 60,000 hours of LibriVox audio recordings, also known as the Libri-Light dataset [5], with a sampling frequency of 16kHz. All recordings in this dataset are in the English language.

Performance

We assess NeMo Audio Codec model performance on various datasets to gauge the semantic and perceptual quality of the reconstructed audio produced by discrete tokens.

Perceptual audio quality of the reconstructed audio is evaluated using virtual speech quality objective listener metric (VISQOL) [2], and signal reconstruction is evaluated using scale-invariant signal-to-distortion ratio (SI-SDR) [1]. VISQOL, an objective full-reference metric for perceived audio quality, is assessed in audio mode at 48kHz and speech mode at 16kHz. SI-SDR evaluates signal reconstruction in terms of relative energy of the distortion in the reconstructed signal, accounting for scale or magnitude variations.

We evaluated audio codec model performance on three scenarios:

  1. Evaluating Semantic content reconstruction on reconstructed audio from codecs Using a pretrained stt_en_fastconformer_ctc_large model
  2. Evaluating Speaker Voice retaining ability of these models using TitaNet-L model
  3. Training an ASR model with these discrete tokens and evaluating Word Error Rate (WER).

Evaluating Semantic Content Reconstruction

Model VISQOL
(audio mode)
VISQOL
(speech mode)
SI-SDR (dB) SNR (dB) WER % WER% /
original audio
EnCodec 24 kHz 4.34 4.27 4.83 6.22 2.53 2.08
Audio Codec 16 kHz 4.53 4.61 4.28 5.97 2.29 2.08

Table: NeMo Audio Codec Model performance is evaluated on Librispeech test clean set using various perceptual quality evaluations and reconstructed audio WER is also shown.

Model VISQOL
audio mode
VISQOL
speech mode
SI-SDR / dB SNR / dB WER % WER %
original audio
EnCodec 24 kHz 4.34 4.17 5.58 6.89 5.27 4.22
Audio Codec 16 kHz 4.52 4.56 5.02 6.62 4.78 4.22

Table: NeMo Audio Codec Model performance is evaluated on Librispeech test other set using various perceptual quality evaluations and reconstructed audio WER is also shown.

Model Fisher MCV11 SPGI VoxPopuli
Original audio 11.19 6.97 6.53 5.69
EnCodec 24kHz 14.28 9.95 6.81 6.18
Audio Codec 16kHz 12.84 8.44 6.73 6.06

Table: Performance evaluation on additional evaluation datasets with reconstructed audio. Here original audio model corresponds to evaluating on original audio with stt_en_fastconformer_ctc_large model.

Evaluating Speaker Voice Retaining Ability

Model Reconstructed Audio
EER
Original audio
EER
EnCodec 24kHz 1.66 0.65
Audio Codec 16kHz 1.21 0.65

Table Speaker Verification EER on voxceleb-test clean set on reconstructed audio and original audio using TitaNet-L model.

Evaluating Codes for Downstream Tasks

To evaluate performance of audio codecs on downstream tasks, we trained stt_en_fastconformer_rnnt_large on these audio codes/tokens for speaker recognition task.

Model Bit-rate Number of
codebooks
dev-clean dev-other test-clean test-other
Encodec 24kHz 12 16 2.26 5.77 2.45 5.80
Encodec 24kHz 6 8 2.23 6.02 2.35 5.96
Audio Codec 16kHz 6.4 8 2.19 5.72 2.4 5.76

How to Use this Model

The model is available for use in the NeMo toolkit Cite NeMo, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically load the model from NGC

from  nemo.collections.tts.models  import  AudioCodecModel
nemo_codec_model  =  AudioCodecModel.from_pretrained('audio_codec_16khz_small')

Getting discrete tokens from Audio

import librosa
import torch
audio,sr = librosa.load("<path_to_audio>", sr=16000)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
audio_tensor = torch.from_numpy(audio).unsqueeze(dim=0).to(device)
audio_len = torch.tensor([audio_tensor[0].shape[0]]).to(device)
nemo_codec_model.freeze()
encoded_tokens, encoded_len = nemo_codec_model.encode(audio=audio_tensor, audio_len=audio_len)

Reconstructing audio from discrete tokens

import soundfile as sf
reconstructed_audio = nemo_codec_model.decode(tokens=encoded_tokens, tokens_len=encoded_len)

Listen to audio

output_audio  =  reconstructed_audio.cpu().numpy()[0]
sf.write(<path_to_output_audio>,  output_audio,  samplerate=16000)

Input

This model accepts single-channel raw audio signal sampled at 16000 Hz as input.

Output

Model encodes audio to discrete tokens and also decodes them to reconstructed audio as shown in above code snippet.

References

  1. J. Le Roux et al., SDR – Half-baked or Well Done?, Proc. ICASSP, 2019.
  2. Chinen et al., ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric, Proc. Intl. Conf. on Quality of Multimedia Experience (QoMEX), 2020.
  3. Défossez et al., High Fidelity Neural Audio Compression, 2022.
  4. NVIDIA NeMo Toolkit
  5. Libri-Light: A Benchmark for ASR with Limited or No Supervision