Audio Codec 16kHz Small

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

This model card contains a Small Audio Codec model trained on the Libri-Light audiobook recordings dataset, comprising approximately 60,000 hours of English language speech with a 16kHz sampling rate.

Publisher

NVIDIA

Latest Version

Modified

February 27, 2024

Size

56.37 MB

Model Overview

This model card contains a NeMo Audio Codec model trained on the Libri-Light audiobook recordings dataset, comprising approximately 60,000 hours of English language speech with a 16kHz sampling rate. An Audio Codec model offers a low bit rate discrete representation of audio. The current 14M parameter model encodes audio information at a bitrate of 6.4kbps.

Model Architecture

The NeMo Audio Codec model is a non-autoregressive convolutional encoder-quantizer-decoder model for audio codec extraction. This 14M parameter model is trained end-to-end using time-domain loss, discriminative loss, and frequency domain loss, similar to other audio codec works Encodec [3].

Figure: NeMo Audio Codec Model Architecture

Model	Sampling rate (kHz)	Size (M)	Latent dim	Codebook size	Num of codebooks	Framerate (fps)	Bitrate (kbps)
EnCodec	24	14.85	128	1024	8	75	6
Audio Codec	16	13.75	128	1024	8	80	6.4

Training

The NeMo Audio Codec Model was trained using the NVIDIA NeMo [4] toolkit for 130k steps with an effective batch size of 256 on a single node of 8 NVIDIA V100s. The current model can be trained using this example script and this base configuration.

Datasets

The NeMo Audio Codec model is trained on over 60,000 hours of LibriVox audio recordings, also known as the Libri-Light dataset [5], with a sampling frequency of 16kHz. All recordings in this dataset are in the English language.

Performance

We assess NeMo Audio Codec model performance on various datasets to gauge the semantic and perceptual quality of the reconstructed audio produced by discrete tokens.

Perceptual audio quality of the reconstructed audio is evaluated using virtual speech quality objective listener metric (VISQOL) [2], and signal reconstruction is evaluated using scale-invariant signal-to-distortion ratio (SI-SDR) [1]. VISQOL, an objective full-reference metric for perceived audio quality, is assessed in audio mode at 48kHz and speech mode at 16kHz. SI-SDR evaluates signal reconstruction in terms of relative energy of the distortion in the reconstructed signal, accounting for scale or magnitude variations.

We evaluated audio codec model performance on three scenarios:

Evaluating Semantic content reconstruction on reconstructed audio from codecs Using a pretrained stt_en_fastconformer_ctc_large model
Evaluating Speaker Voice retaining ability of these models using TitaNet-L model
Training an ASR model with these discrete tokens and evaluating Word Error Rate (WER).

Evaluating Semantic Content Reconstruction

Model	VISQOL (audio mode)	VISQOL (speech mode)	SI-SDR (dB)	SNR (dB)	WER %	WER% / original audio
EnCodec 24 kHz	4.34	4.27	4.83	6.22	2.53	2.08
Audio Codec 16 kHz	4.53	4.61	4.28	5.97	2.29	2.08

Table: NeMo Audio Codec Model performance is evaluated on Librispeech test clean set using various perceptual quality evaluations and reconstructed audio WER is also shown.

Model	VISQOL audio mode	VISQOL speech mode	SI-SDR / dB	SNR / dB	WER %	WER % original audio
EnCodec 24 kHz	4.34	4.17	5.58	6.89	5.27	4.22
Audio Codec 16 kHz	4.52	4.56	5.02	6.62	4.78	4.22

Table: NeMo Audio Codec Model performance is evaluated on Librispeech test other set using various perceptual quality evaluations and reconstructed audio WER is also shown.

Model	Fisher	MCV11	SPGI	VoxPopuli
Original audio	11.19	6.97	6.53	5.69
EnCodec 24kHz	14.28	9.95	6.81	6.18
Audio Codec 16kHz	12.84	8.44	6.73	6.06

Table: Performance evaluation on additional evaluation datasets with reconstructed audio. Here original audio model corresponds to evaluating on original audio with stt_en_fastconformer_ctc_large model.

Evaluating Speaker Voice Retaining Ability

Model	Reconstructed Audio EER	Original audio EER
EnCodec 24kHz	1.66	0.65
Audio Codec 16kHz	1.21	0.65

Table Speaker Verification EER on voxceleb-test clean set on reconstructed audio and original audio using TitaNet-L model.

Evaluating Codes for Downstream Tasks

To evaluate performance of audio codecs on downstream tasks, we trained stt_en_fastconformer_rnnt_large on these audio codes/tokens for speaker recognition task.

Model	Bit-rate	Number of codebooks	dev-clean	dev-other	test-clean	test-other
Encodec 24kHz	12	16	2.26	5.77	2.45	5.80
Encodec 24kHz	6	8	2.23	6.02	2.35	5.96
Audio Codec 16kHz	6.4	8	2.19	5.72	2.4	5.76

How to Use this Model

The model is available for use in the NeMo toolkit Cite NeMo, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically load the model from NGC

from  nemo.collections.tts.models  import  AudioCodecModel
nemo_codec_model  =  AudioCodecModel.from_pretrained('audio_codec_16khz_small')

Getting discrete tokens from Audio

import librosa
import torch
audio,sr = librosa.load("<path_to_audio>", sr=16000)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
audio_tensor = torch.from_numpy(audio).unsqueeze(dim=0).to(device)
audio_len = torch.tensor([audio_tensor[0].shape[0]]).to(device)
nemo_codec_model.freeze()
encoded_tokens, encoded_len = nemo_codec_model.encode(audio=audio_tensor, audio_len=audio_len)

Reconstructing audio from discrete tokens

import soundfile as sf
reconstructed_audio = nemo_codec_model.decode(tokens=encoded_tokens, tokens_len=encoded_len)

Listen to audio

output_audio  =  reconstructed_audio.cpu().numpy()[0]
sf.write(<path_to_output_audio>,  output_audio,  samplerate=16000)

Input

This model accepts single-channel raw audio signal sampled at 16000 Hz as input.

Output

Model encodes audio to discrete tokens and also decodes them to reconstructed audio as shown in above code snippet.