Mel Codec 44kHz fullband medium

—

Model

—

Mel Codec 44kHz fullband medium

44.1kHz full-band Mel Codec model trained on multi-lingual speech.

Model Overview

The NeMo Mel Codec is a spectrogram-based audio codec which compresses audio into a quantized representation by encoding mel-spectrogram features. Compared to standard audio codecs, the mel codec produces better audio quality when used with speech synthesis models.

The model works with full-bandwidth 44.1kHz audio. The quantized features (tokens) are encoded with token rate of 86.1 tokens per second and 80 bits per token, resulting in 6.9 kbps bitrate.

Model Architecture

The NeMo Mel Codec model is a non-autoregressive encoder-quantizer-decoder model for audio token extraction. This 64M parameter model is trained end-to-end using mel and STFT reconstruction losses, and adversarial training with a multi-period descriminator and multi-scale complex STFT discriminator.

The model uses the mel-spectrogram FSQ configuration described in [1].

Training

The model was trained using the NVIDIA NeMo toolkit [2] for 100k steps on 64 NVIDIA A100 GPUs with a batch size of 96 per GPU.

Datasets

The model was trained on an internal dataset containining 12.8k hours of English LibriVox data with 2.7k speakers, and the Common Voice 13 [3] dataset containing 1.4k hours of multi-lingual data with 79 languages and 50k speakers. All training data used was full-bandwidth 44.1kHz.

Performance

We evaluate our codec using several objective audio quality metrics. We evaluate ViSQOL [4] and PESQ [5] for perception quality, ESTOI [6] for intelligbility, and the mel spectrogram and STFT distances to for reconstruction accuracy. Metrics are reported on the test set for both the LibriVox and CommonVoice data. The model has not been trained or evaluated on non-speech audio.

Dataset	ViSQOL	PESQ	ESTOI	Mel Distance	STFT Distance
LibriVox	4.51	3.20	0.92	0.092	0.032
CommonVoice	4.52	2.93	0.90	0.126	0.054

How to Use This Model

The model is available for use in the NVIDIA NeMo toolkit, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically load the model from NGC

from nemo.collections.tts.models import AudioCodecModel
nemo_codec_model = AudioCodecModel.from_pretrained('mel_codec_44khz_fullband_medium')

Getting discrete tokens from Audio

import librosa
import torch
audio, _ = librosa.load("<path_to_audio>", sr=nemo_codec_model.sample_rate)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
audio_tensor = torch.from_numpy(audio).unsqueeze(dim=0).to(device)
audio_len = torch.tensor([audio_tensor[0].shape[0]]).to(device)
nemo_codec_model.freeze()
encoded_tokens, encoded_len = nemo_codec_model.encode(audio=audio_tensor, audio_len=audio_len)

Reconstructing audio from discrete tokens

reconstructed_audio, _ = nemo_codec_model.decode(tokens=encoded_tokens, tokens_len=encoded_len)

Listen to audio

import soundfile as sf
output_audio = reconstructed_audio.cpu().numpy().squeeze()
sf.write("<path_to_output_audio>", output_audio, samplerate=nemo_codec_model.sample_rate)

Input

The model accepts single-channel audio sampled at 44100 Hz as input.

Output

The model encodes audio to discrete tokens and decodes the discrete tokens to reconstruct the original audio.

Limitations

The model is trained on 44.1 kHz speech data. The reconstructed audio might not be accurate for low-bandwidth speech (e.g. 16kHz speech upsampled to 44.1kHz) or for non-speech audio.

References

License

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.

Publisher

—

Latest Versionv1

UpdatedJuly 5, 2024 UTC

Compressed Size406.78 MB

Labels

Audio Synthesis NeMo Speech to Text