Mel Codec 22kHz fullband medium

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

22.05kHz full-band Mel Codec model trained on multi-lingual speech.

Publisher

Latest Version

Modified

July 5, 2024

Size

406.11 MB

Model Overview

The NeMo Mel Codec is a spectrogram-based audio codec which compresses audio into a quantized representation by encoding mel-spectrogram features. Compared to standard audio codecs, the mel codec produces better audio quality when used with speech synthesis models.

The model works with full-bandwidth 22.05kHz audio. The quantized features (tokens) are encoded with token rate of 86.1 tokens per second and 80 bits per token, resulting in 6.9 kbps bitrate.

Model Architecture

The NeMo Mel Codec model is a non-autoregressive encoder-quantizer-decoder model for audio token extraction. This 64M parameter model is trained end-to-end using mel and STFT reconstruction losses, and adversarial training with a multi-period descriminator and multi-scale complex STFT discriminator.

The model uses the mel-spectrogram FSQ configuration described in [1].

Training

The model was trained using the NVIDIA NeMo toolkit [2] for 100k steps on 64 NVIDIA A100 GPUs with a batch size of 128 per GPU.

Datasets

The model was trained on an internal dataset containining 25.5k hours of English LibriVox data with 4.3k speakers, and the Common Voice 13 [3] dataset containing 3.2k hours of multi-lingual data with 105 languages and 100k speakers. All training data used was full-bandwidth 22.05kHz.

Performance

We evaluate our codec using several objective audio quality metrics. We evaluate ViSQOL [4] and PESQ [5] for perception quality, ESTOI [6] for intelligbility, and the mel spectrogram and STFT distances to for reconstruction accuracy. Metrics are reported on the test set for both the LibriVox and CommonVoice data. The model has not been trained or evaluated on non-speech audio.

Dataset	ViSQOL	PESQ	ESTOI	Mel Distance	STFT Distance
LibriVox	4.48	3.43	0.92	0.069	0.034
CommonVoice	4.51	3.21	0.91	0.100	0.057

How to Use This Model

The model is available for use in the NVIDIA NeMo toolkit, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically load the model from NGC

from nemo.collections.tts.models import AudioCodecModel
nemo_codec_model = AudioCodecModel.from_pretrained('mel_codec_22khz_fullband_medium')

Getting discrete tokens from Audio

import librosa
import torch
audio, _ = librosa.load("<path_to_audio>", sr=nemo_codec_model.sample_rate)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
audio_tensor = torch.from_numpy(audio).unsqueeze(dim=0).to(device)
audio_len = torch.tensor([audio_tensor[0].shape[0]]).to(device)
nemo_codec_model.freeze()
encoded_tokens, encoded_len = nemo_codec_model.encode(audio=audio_tensor, audio_len=audio_len)

Reconstructing audio from discrete tokens

reconstructed_audio, _ = nemo_codec_model.decode(tokens=encoded_tokens, tokens_len=encoded_len)

Listen to audio

import soundfile as sf
output_audio = reconstructed_audio.cpu().numpy().squeeze()
sf.write("<path_to_output_audio>", output_audio, samplerate=nemo_codec_model.sample_rate)

Input

The model accepts single-channel audio sampled at 22050 Hz as input.

Output

The model encodes audio to discrete tokens and decodes the discrete tokens to reconstruct the original audio.

Limitations

The model is trained on 22.05 kHz speech data. The reconstructed audio might not be accurate for low-bandwidth speech (e.g. 16kHz speech upsampled to 22.05kHz) or for non-speech audio.

References

License

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.