NVIDIA

Riva TTS A²-Flow for Nv IGI SDK

Model

NVIDIA

Riva TTS A²-Flow for Nv IGI SDK

Riva TTS A²-Flow model for the NVIDIA In-Game Inferencing (NVIGI) SDK.

Runs on RTX

Speech Synthesis: Multilingual Zero-shot Voice Characterization - A2-Flow Model Overview

Description:

The Multilingual Zero-shot Voice Characterization A2-Flow model can analyze a speaker’s voice and replicate voice qualities such as pitch, timbre and speech rate with a 5 seconds or less audio prompt. It achieves a speaker similarity of over 70%, and an MOS score of 4.40. Maintaining the original characteristics that capture unique voice audio signature, it can create high-quality audio (speech) when used in combination with a vocoder model like BigVGAN [1].

A2-Flow [2] is an alignment-aware pre-training method that builds upon E2TTS’s [3] training framework to learn alignment between unit sequences and speech frames. By using de-duplicated units that retain only phonetic content, A2-Flow effectively learns alignment without relying on a phoneme duration predictor. This allows for direct application to zero-shot voice conversion, where phonetic content can be transferred to the target speaker’s voice without additional fine-tuning. This model is packaged with BigVGAN, a universal vocoder that generalizes well for various out-of-distribution scenarios without fine-tuning.

This model is ready for commercial use.

License/Terms of Use:

NVIDIA AI Foundation Models Community License Agreement

References:

[1] BigVGAN: A Universal Neural Vocoder with Large-Scale Training
[2] (A2-Flow Paper coming soon)
[3] E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
[4] Flow Matching for Generative Modeling

Model Architecture:

Architecture Type: Flow Matching
Network Architecture: Optimal Transport Conditional Flow Matching (OT-CFM)-based Masked Speech Modeling

Flow Matching [4] (FM) is a simulation-free approach for training Continuous Normalizing Flows (CNFs) based on regressing vector fields of fixed conditional probability paths. It is compatible with a general family of Gaussian probability paths for transforming between noise and data samples — which subsumes existing diffusion paths as specific instances. Furthermore, Flow Matching opens the door to training CNFs with other, non-diffusion probability paths. An instance of particular interest is using Optimal Transport (OT) displacement interpolation to define the conditional probability paths. These paths are more efficient than diffusion paths, provide faster training and sampling, and result in better generalization.

Input:

Input Type: Text + Audio
Input Format:
For Text: Strings (Graphemes in US English, US Spanish, French or German)
For Audio: binary voice file

Note: A sample set of binary voice files will be included with the NVIGI SDK plugin. To generate your own binary voice files, please Contact Us.

Input Parameters:
For text: One-Dimensional (1D)
For audio prompt: Two-Dimensional (batch x time)
Other Properties related to Input:
For Text: 400 Character Text String Limit, with a word count that should not exceed 20 seconds of audio when synthesized, as this will significantly degrade quality.
For Audio: Mono, PCM-encoded 16 bit audio; sampling rate of 22.05 kHz; between 3 and 5 second duration.

Output:

Output Type: Audio
Output Format: Audio of shape (batch x time) in wav format
Output Parameters: Two-Dimensional (batch x time)
Other Properties related to Output: Mono, PCM-encoded 16 bit audio; sampling rate of 22.05 kHz; 20 Second Maximum Length.

Software Integration:

Runtime Engine(s): TensorRT

Supported Hardware Platform(s):

NVIDIA Lovelace and later.

Supported Operating System(s):

Windows
Linux

Model Version(s):

Riva_TTS_A2-Flow_v1

Training & Evaluation Datasets:

Training Dataset:

Link: DATA IS PRIVATE AND ONLY NVIDIA-INTERNAL
Datasets PLC and Legal Approval (MCAT, T5 and A2Flow)
Dataset License(s): NVIDIA proprietary data. NSpect ID: see doc

** Data Collection Method by dataset

[Human]
Properties: ~62k hours of TTS, speech and audio data combining proprietary and public datasets.

Evaluation Dataset:

Link: DATA IS PRIVATE AND ONLY NVIDIA-INTERNAL
Datasets PLC and Legal Approval (MCAT, T5 and A2Flow)
Dataset License(s): NVIDIA proprietary data. NSpect ID: see doc

** Data Collection Method by dataset

[Human]
Properties: 30 minutes of TTS, speech and audio data combining proprietary and public datasets.

Inference:

Engine: TensorRT
Test Hardware:

NVIDIA RTX 4090

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.

Please report security vulnerabilities or NVIDIA AI Concerns here.

Publisher

NVIDIA

Latest Version1.3

UpdatedMay 1, 2025 UTC

Compressed Size1.16 GB

Labels

Gaming NSPECT-0GBH-AP7A Nv IGI SDK Text to Speech TTS