The Multilingual Zero-shot Voice Characterization A2-Flow model can analyze a speaker’s voice and replicate voice qualities such as pitch, timbre and speech rate with a 5 seconds or less audio prompt. It achieves a speaker similarity of over 70%, and an MOS score of 4.40. Maintaining the original characteristics that capture unique voice audio signature, it can create high-quality audio (speech) when used in combination with a vocoder model like BigVGAN [1].
A2-Flow [2] is an alignment-aware pre-training method that builds upon E2TTS’s [3] training framework to learn alignment between unit sequences and speech frames. By using de-duplicated units that retain only phonetic content, A2-Flow effectively learns alignment without relying on a phoneme duration predictor. This allows for direct application to zero-shot voice conversion, where phonetic content can be transferred to the target speaker’s voice without additional fine-tuning. This model is packaged with BigVGAN, a universal vocoder that generalizes well for various out-of-distribution scenarios without fine-tuning.
This model is ready for commercial use.
NVIDIA AI Foundation Models Community License Agreement
[1] BigVGAN: A Universal Neural Vocoder with Large-Scale Training
[2] (A2-Flow Paper coming soon)
[3] E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
[4] Flow Matching for Generative Modeling
Architecture Type: Flow Matching
Network Architecture: Optimal Transport Conditional Flow Matching (OT-CFM)-based Masked Speech Modeling
Flow Matching [4] (FM) is a simulation-free approach for training Continuous Normalizing Flows (CNFs) based on regressing vector fields of fixed conditional probability paths. It is compatible with a general family of Gaussian probability paths for transforming between noise and data samples — which subsumes existing diffusion paths as specific instances. Furthermore, Flow Matching opens the door to training CNFs with other, non-diffusion probability paths. An instance of particular interest is using Optimal Transport (OT) displacement interpolation to define the conditional probability paths. These paths are more efficient than diffusion paths, provide faster training and sampling, and result in better generalization.
Input Type: Text + Audio
Input Format:
For Text: Strings (Graphemes in US English, US Spanish, French or German)
For Audio: binary voice file
Input Parameters:
For text: One-Dimensional (1D)
For audio prompt: Two-Dimensional (batch x time)
Other Properties related to Input:
For Text: 400 Character Text String Limit, with a word count that should not exceed 20 seconds of audio when synthesized, as this will significantly degrade quality.
For Audio: Mono, PCM-encoded 16 bit audio; sampling rate of 22.05 kHz; between 3 and 5 second duration.
Output Type: Audio
Output Format: Audio of shape (batch x time) in wav format
Output Parameters: Two-Dimensional (batch x time)
Other Properties related to Output: Mono, PCM-encoded 16 bit audio; sampling rate of 22.05 kHz; 20 Second Maximum Length.
Runtime Engine(s): TensorRT
Supported Hardware Platform(s):
Supported Operating System(s):
Riva_TTS_A2-Flow_v1
Link: DATA IS PRIVATE AND ONLY NVIDIA-INTERNAL
Datasets PLC and Legal Approval (MCAT, T5 and A2Flow)
Dataset License(s): NVIDIA proprietary data. NSpect ID: see doc
** Data Collection Method by dataset
Link: DATA IS PRIVATE AND ONLY NVIDIA-INTERNAL
Datasets PLC and Legal Approval (MCAT, T5 and A2Flow)
Dataset License(s): NVIDIA proprietary data. NSpect ID: see doc
** Data Collection Method by dataset
Engine: TensorRT
Test Hardware:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.
Please report security vulnerabilities or NVIDIA AI Concerns here.