The NeMo Mel Codec is a spectrogram-based audio codec which compresses audio into a quantized representation by encoding mel-spectrogram features. Compared to standard audio codecs, the mel codec produces better audio quality when used with speech synthesis models.
The model works with full-bandwidth 44.1kHz audio. The quantized features (tokens) are encoded with token rate of 86.1 tokens per second and 80 bits per token, resulting in 6.9 kbps bitrate.
The NeMo Mel Codec model is a non-autoregressive encoder-quantizer-decoder model for audio token extraction. This 64M parameter model is trained end-to-end using mel and STFT reconstruction losses, and adversarial training with a multi-period descriminator and multi-scale complex STFT discriminator.
The model uses the multi-band mel-spectrogram FSQ configuration described in [1].
The model was trained using the NVIDIA NeMo toolkit [2] for 100k steps on 64 NVIDIA A100 GPUs with a batch size of 96 per GPU.
The model was trained on an internal dataset containining 12.8k hours of English LibriVox data with 2.7k speakers, and the Common Voice 13 [3] dataset containing 1.4k hours of multi-lingual data with 79 languages and 50k speakers. All training data used was full-bandwidth 44.1kHz.
We evaluate our codec using several objective audio quality metrics. We evaluate ViSQOL [4] and PESQ [5] for perception quality, ESTOI [6] for intelligbility, and the mel spectrogram and STFT distances to for reconstruction accuracy. Metrics are reported on the test set for both the LibriVox and CommonVoice data. The model has not been trained or evaluated on non-speech audio.
Dataset | ViSQOL | PESQ | ESTOI | Mel Distance | STFT Distance |
---|---|---|---|---|---|
LibriVox | 4.51 | 3.16 | 0.90 | 0.095 | 0.033 |
CommonVoice | 4.55 | 2.67 | 0.88 | 0.128 | 0.054 |
The model is available for use in the NVIDIA NeMo toolkit, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
from nemo.collections.tts.models import AudioCodecModel
nemo_codec_model = AudioCodecModel.from_pretrained('mel_codec_44khz_medium')
import librosa
import torch
audio, _ = librosa.load("<path_to_audio>", sr=nemo_codec_model.sample_rate)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
audio_tensor = torch.from_numpy(audio).unsqueeze(dim=0).to(device)
audio_len = torch.tensor([audio_tensor[0].shape[0]]).to(device)
nemo_codec_model.freeze()
encoded_tokens, encoded_len = nemo_codec_model.encode(audio=audio_tensor, audio_len=audio_len)
reconstructed_audio, _ = nemo_codec_model.decode(tokens=encoded_tokens, tokens_len=encoded_len)
import soundfile as sf
output_audio = reconstructed_audio.cpu().numpy().squeeze()
sf.write("<path_to_output_audio>", output_audio, samplerate=nemo_codec_model.sample_rate)
The model accepts single-channel audio sampled at 44100 Hz as input.
The model encodes audio to discrete tokens and decodes the discrete tokens to reconstruct the original audio.
The model is trained on 44.1 kHz speech data. The reconstructed audio might not be accurate for low-bandwidth speech (e.g. 16kHz speech upsampled to 44.1kHz) or for non-speech audio.
License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.