44.1kHz full-band Mel Codec model trained on multi-lingual speech.
Model Overview
The NeMo Mel Codec is a spectrogram-based audio codec which compresses audio into a quantized representation by encoding mel-spectrogram features. Compared to standard audio codecs, the mel codec produces better audio quality when used with speech synthesis models.
The model works with full-bandwidth 44.1kHz audio. The quantized features (tokens) are encoded with token rate of 86.1 tokens per second and 80 bits per token, resulting in 6.9 kbps bitrate.
Model Architecture

The NeMo Mel Codec model is a non-autoregressive encoder-quantizer-decoder model for audio token extraction. This 64M parameter model is trained end-to-end using mel and STFT reconstruction losses, and adversarial training with a multi-period descriminator and multi-scale complex STFT discriminator.
The model uses the mel-spectrogram FSQ configuration described in [1].
Training
The model was trained using the NVIDIA NeMo toolkit [2] for 100k steps on 64 NVIDIA A100 GPUs with a batch size of 96 per GPU.
Datasets
The model was trained on an internal dataset containining 12.8k hours of English LibriVox data with 2.7k speakers, and the Common Voice 13 [3] dataset containing 1.4k hours of multi-lingual data with 79 languages and 50k speakers. All training data used was full-bandwidth 44.1kHz.
Performance
We evaluate our codec using several objective audio quality metrics. We evaluate ViSQOL [4] and PESQ [5] for perception quality, ESTOI [6] for intelligbility, and the mel spectrogram and STFT distances to for reconstruction accuracy. Metrics are reported on the test set for both the LibriVox and CommonVoice data. The model has not been trained or evaluated on non-speech audio.
| Dataset | ViSQOL | PESQ | ESTOI | Mel Distance | STFT Distance |
|---|---|---|---|---|---|
| LibriVox | 4.51 | 3.20 | 0.92 | 0.092 | 0.032 |
| CommonVoice | 4.52 | 2.93 | 0.90 | 0.126 | 0.054 |
How to Use This Model
The model is available for use in the NVIDIA NeMo toolkit, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
Automatically load the model from NGC
Getting discrete tokens from Audio
Reconstructing audio from discrete tokens
Listen to audio
Input
The model accepts single-channel audio sampled at 44100 Hz as input.
Output
The model encodes audio to discrete tokens and decodes the discrete tokens to reconstruct the original audio.
Limitations
The model is trained on 44.1 kHz speech data. The reconstructed audio might not be accurate for low-bandwidth speech (e.g. 16kHz speech upsampled to 44.1kHz) or for non-speech audio.
References
License
License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.