This model card contains a NeMo Audio Codec model trained on the Libri-Light audiobook recordings dataset, comprising approximately 60,000 hours of English language speech with a 16kHz sampling rate. An Audio Codec model offers a low bit rate discrete representation of audio. The current 14M parameter model encodes audio information at a bitrate of 6.4kbps.
The NeMo Audio Codec model is a non-autoregressive convolutional encoder-quantizer-decoder model for audio codec extraction. This 14M parameter model is trained end-to-end using time-domain loss, discriminative loss, and frequency domain loss, similar to other audio codec works Encodec [3].
Figure: NeMo Audio Codec Model Architecture
Model | Sampling rate (kHz) | Size (M) | Latent dim | Codebook size | Num of codebooks | Framerate (fps) | Bitrate (kbps) |
---|---|---|---|---|---|---|---|
EnCodec | 24 | 14.85 | 128 | 1024 | 8 | 75 | 6 |
Audio Codec | 16 | 13.75 | 128 | 1024 | 8 | 80 | 6.4 |
The NeMo Audio Codec Model was trained using the NVIDIA NeMo [4] toolkit for 130k steps with an effective batch size of 256 on a single node of 8 NVIDIA V100s. The current model can be trained using this example script and this base configuration.
The NeMo Audio Codec model is trained on over 60,000 hours of LibriVox audio recordings, also known as the Libri-Light dataset [5], with a sampling frequency of 16kHz. All recordings in this dataset are in the English language.
We assess NeMo Audio Codec model performance on various datasets to gauge the semantic and perceptual quality of the reconstructed audio produced by discrete tokens.
Perceptual audio quality of the reconstructed audio is evaluated using virtual speech quality objective listener metric (VISQOL) [2], and signal reconstruction is evaluated using scale-invariant signal-to-distortion ratio (SI-SDR) [1]. VISQOL, an objective full-reference metric for perceived audio quality, is assessed in audio mode at 48kHz and speech mode at 16kHz. SI-SDR evaluates signal reconstruction in terms of relative energy of the distortion in the reconstructed signal, accounting for scale or magnitude variations.
We evaluated audio codec model performance on three scenarios:
Model | VISQOL (audio mode) |
VISQOL (speech mode) |
SI-SDR (dB) | SNR (dB) | WER % | WER% / original audio |
---|---|---|---|---|---|---|
EnCodec 24 kHz | 4.34 | 4.27 | 4.83 | 6.22 | 2.53 | 2.08 |
Audio Codec 16 kHz | 4.53 | 4.61 | 4.28 | 5.97 | 2.29 | 2.08 |
Table: NeMo Audio Codec Model performance is evaluated on Librispeech test clean set using various perceptual quality evaluations and reconstructed audio WER is also shown.
Model | VISQOL audio mode |
VISQOL speech mode |
SI-SDR / dB | SNR / dB | WER % | WER % original audio |
---|---|---|---|---|---|---|
EnCodec 24 kHz | 4.34 | 4.17 | 5.58 | 6.89 | 5.27 | 4.22 |
Audio Codec 16 kHz | 4.52 | 4.56 | 5.02 | 6.62 | 4.78 | 4.22 |
Table: NeMo Audio Codec Model performance is evaluated on Librispeech test other set using various perceptual quality evaluations and reconstructed audio WER is also shown.
Model | Fisher | MCV11 | SPGI | VoxPopuli |
---|---|---|---|---|
Original audio | 11.19 | 6.97 | 6.53 | 5.69 |
EnCodec 24kHz | 14.28 | 9.95 | 6.81 | 6.18 |
Audio Codec 16kHz | 12.84 | 8.44 | 6.73 | 6.06 |
Table: Performance evaluation on additional evaluation datasets with reconstructed audio. Here original audio model corresponds to evaluating on original audio with stt_en_fastconformer_ctc_large
model.
Model | Reconstructed Audio EER |
Original audio EER |
---|---|---|
EnCodec 24kHz | 1.66 | 0.65 |
Audio Codec 16kHz | 1.21 | 0.65 |
Table Speaker Verification EER on voxceleb-test clean set on reconstructed audio and original audio using TitaNet-L model.
To evaluate performance of audio codecs on downstream tasks, we trained stt_en_fastconformer_rnnt_large
on these audio codes/tokens for speaker recognition task.
Model | Bit-rate | Number of codebooks |
dev-clean | dev-other | test-clean | test-other |
---|---|---|---|---|---|---|
Encodec 24kHz | 12 | 16 | 2.26 | 5.77 | 2.45 | 5.80 |
Encodec 24kHz | 6 | 8 | 2.23 | 6.02 | 2.35 | 5.96 |
Audio Codec 16kHz | 6.4 | 8 | 2.19 | 5.72 | 2.4 | 5.76 |
The model is available for use in the NeMo toolkit Cite NeMo, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
from nemo.collections.tts.models import AudioCodecModel
nemo_codec_model = AudioCodecModel.from_pretrained('audio_codec_16khz_small')
import librosa
import torch
audio,sr = librosa.load("<path_to_audio>", sr=16000)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
audio_tensor = torch.from_numpy(audio).unsqueeze(dim=0).to(device)
audio_len = torch.tensor([audio_tensor[0].shape[0]]).to(device)
nemo_codec_model.freeze()
encoded_tokens, encoded_len = nemo_codec_model.encode(audio=audio_tensor, audio_len=audio_len)
import soundfile as sf
reconstructed_audio = nemo_codec_model.decode(tokens=encoded_tokens, tokens_len=encoded_len)
output_audio = reconstructed_audio.cpu().numpy()[0]
sf.write(<path_to_output_audio>, output_audio, samplerate=16000)
This model accepts single-channel raw audio signal sampled at 16000 Hz as input.
Model encodes audio to discrete tokens and also decodes them to reconstructed audio as shown in above code snippet.