NGC | Catalog
CatalogModelsTTS Es Multispeaker FastPitch HiFiGAN

TTS Es Multispeaker FastPitch HiFiGAN

Description
This collection contains two models. 1) Multi-speaker 44100Hz FastPitch trained on approximately 20 hours of Latin American Spanish speech from 174 speakers. 2) HiFiGAN trained on mel spectrograms produced by the Multi-speaker FastPitch in (1).
Publisher
NVIDIA
Latest Version
1.15.0
Modified
April 4, 2023
Size
501 MB

Model Overview

This collection contains two models:

  1. Multi-speaker 44100Hz FastPitch (around 50M parameters) trained on approximately 20 hours of Latin American Spanish speech from 174 speakers (1 to 10 minutes of audio per speaker) spread across 6 locales (es-AR, es-CL, es-CO, es-PE, es-PR, es-VE) from OpenSLR.

  2. HiFiGAN trained on mel spectrograms produced by the Multi-speaker FastPitch in (1).

Model Architecture

FastPitch [1] is a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive, better match the semantic of the utterance, and in the end more engaging to the listener. FastPitch is based on a fully-parallel Transformer architecture, with much higher real-time factor than Tacotron2 for mel-spectrogram synthesis of a typical utterance. Additionally it uses unsupervised speech-text aligner [2]. It is trained used a mixed text representation of graphemes and IPA phonemes [3].

HiFiGAN [4], a generative adversarial network (GAN) model that generates audio from mel spectrograms produced by the Multi-speaker FastPitch in (1). The generator uses transposed convolutions to upsample mel spectrograms to audio.

Dataset

The original OpenSLR dataset contains approximately 38 hours of audio sampled at 48000Hz. For training we downsample to 44100Hz, and trim the silence from the beginning and end of each file (totaling about 18 hours of silence), reducing the dataset size from 38 hours to 20 hours.

https://www.openslr.org/61

https://www.openslr.org/71

https://www.openslr.org/72

https://www.openslr.org/73

https://www.openslr.org/74

https://www.openslr.org/75

https://research.google/pubs/pub49150/

Performance

No performance information available at this time.

How to Use this Model

The model is available for use in the NeMo toolkit [4], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

In order to generate spectrogram specific to a particular speaker you will need to provide speaker ID to FastPitch. The speaker IDs go from 0 to 173.

NOTE: For best results you should use the vocoder (HiFiGAN) checkpoint in this model card along with the mel spectrogram generator (FastPitch) checkpoint.

import soundfile as sf
from nemo.collections.tts.models import FastPitchModel
from nemo.collections.tts.models import HifiGanModel

fastpitch_name = "tts_es_fastpitch_multispeaker"
hifigan_name = "tts_es_hifigan_ft_fastpitch_multispeaker"

# Load spectrogram generator
spec_generator = FastPitchModel.from_pretrained(fastpitch_name)

# Load Vocoder
model = HifiGanModel.from_pretrained(hifigan_name)

# Generate audio
text = "Escribe tu texto aquí."
# Optionally, provide custom IPA input escaped with |
# text = "|e s k ɾ ˈ i β e| tu |t ˈ e k s t o| aquí."

parsed = spec_generator.parse(text, normalize=False)
speaker = 5
spectrogram = spec_generator.generate_spectrogram(tokens=parsed, speaker=speaker)
audio = model.convert_spectrogram_to_audio(spec=spectrogram)
audio = audio.detach().cpu().numpy()

# Save the audio to disk in a file called speech.wav
sample_rate = 44100
sf.write("speech.wav", audio, sample_rate)

Input/Output

The FastPitch model accepts batches of text and speaker IDs and outputs mel spectrograms. The HifiGan model accepts batches of mel spectrograms and outputs audio.

Limitations

This model was trained on a limited amount of publically available speech data. The performance might degrade for speech which includes terms or vernacular that the model has not been trained on. The model might also perform worse on certain locales, regional dialects, or accents.

Versions

1.15.0 (current): Added IPA support. Contains built in pronunciation dictionary and accepts mixed inputs of graphemes and IPA phonemes. Trained with speaker level pitch normalization. 1.14.0: The original version released with NeMo 1.14.0 which supports Spanish grapheme input.

References

[1] Fastpitch: https://arxiv.org/abs/2006.06873

[2] One TTS Alignment To Rule Them All: https://arxiv.org/abs/2108.10447

[3] Mixed representation training: https://arxiv.org/abs/1811.07240

[4] HiFiGan paper: https://arxiv.org/abs/2010.05646

[5] NVIDIA NeMo Toolkit: https://github.com/NVIDIA/NeMo

[6] OpenSLR data paper: https://arxiv.org/abs/2010.05646

NVIDIA License

1. Definitions

"Licensor" means any person or entity that distributes its Work.

"Work" means (a) the original work of authorship made available under this license, which may include software, documentation, or other files, and (b) any additions to or derivative works thereof that are made available under this license.

The terms "reproduce," "reproduction," "derivative works," and "distribution" have the meaning as provided under U.S. copyright law; provided, however, that for the purposes of this license, derivative works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work.

Works are "made available" under this license by including in or with the Work either (a) a copyright notice referencing the applicability of this license to the Work, or (b) a copy of this license.

2. License Grant

2.1 Copyright Grant. Subject to the terms and conditions of this license, each Licensor grants to you a perpetual, worldwide, non-exclusive, royalty-free, copyright license to use, reproduce, prepare derivative works of, publicly display, publicly perform, sublicense and distribute its Work and any resulting derivative works in any form.

3. Limitations

3.1 Redistribution. You may reproduce or distribute the Work only if (a) you do so under this license, (b) you include a complete copy of this license with your distribution, and (c) you retain without modification any copyright, patent, trademark, or attribution notices that are present in the Work.

3.2 Derivative Works. You may specify that additional or different terms apply to the use, reproduction, and distribution of your derivative works of the Work ("Your Terms") only if (a) Your Terms provide that the use limitation in Section 3.3 applies to your derivative works, and (b) you identify the specific derivative works that are subject to Your Terms. Notwithstanding Your Terms, this license (including the redistribution requirements in Section 3.1) will continue to apply to the Work itself.

3.3 Use Limitation. The Work and any derivative works thereof only may be used or intended for use non-commercially. Notwithstanding the foregoing, NVIDIA Corporation and its affiliates may use the Work and any derivative works commercially. As used herein, "non-commercially" means for research or evaluation purposes only.

3.4 Patent Claims. If you bring or threaten to bring a patent claim against any Licensor (including any claim, cross-claim or counterclaim in a lawsuit) to enforce any patents that you allege are infringed by any Work, then your rights under this license from such Licensor (including the grant in Section 2.1) will terminate immediately.

3.5 Trademarks. This license does not grant any rights to use any Licensor's or its affiliates' names, logos, or trademarks, except as necessary to reproduce the notices described in this license.

3.6 Termination. If you violate any term of this license, then your rights under this license (including the grant in Section 2.1) will terminate immediately.

4. Disclaimer of Warranty.

THE WORK IS PROVIDED "AS IS" WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WARRANTIES OR CONDITIONS OF

MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE OR NON-INFRINGEMENT. YOU BEAR THE RISK OF UNDERTAKING ANY ACTIVITIES UNDER THIS LICENSE.

5. Limitation of Liability.

EXCEPT AS PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER IN TORT (INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE SHALL ANY LICENSOR BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF OR RELATED TO THIS LICENSE, THE USE OR INABILITY TO USE THE WORK (INCLUDING BUT NOT LIMITED TO LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS OR DATA, COMPUTER FAILURE OR MALFUNCTION, OR ANY OTHER DAMAGES OR LOSSES), EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.