NGC Catalog
CLASSIC
Welcome Guest
Models
OpenVoice

OpenVoice

For downloads and more information, please view on a desktop device.
Logo for OpenVoice
Features
Description
A collection of models to enable OpenVoice support for the NVIDIA In-Game Inferencing (NVIGI) SDK.
Publisher
-
Latest Version
OpenVoice v3
Modified
January 29, 2025
Size
195.04 MB

Overview

This is a collection of models to enable OpenVoice support for the NVIDIA In-Game Inferencing (NVIGI) SDK. Please see each overview section below for details on the following models:

  • BERT base model (uncased)
  • MelloTTS
  • OpenVoice Converter

Model Overview : Bert embedding

Description:

Bert-base-uncased is a text embedding model. This is used by OpenVoice Text-to-Speech (TTS) solution to generate embeddings from input text. This model is to be used with the NVIGI SDK OpenVoice TTS plugin.

This model is ready for commercial and non-commercial use.

Model Developer: Google

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see BERT base model (uncased).

License/Terms of Use:

This model is distributed under Apache 2-0 Bert License. Please refer to BERT base model (uncased) Model Card for further details.

Reference(s):

BERT base model (uncased) Model Card

Model Architecture:

Architecture Type: Transformer
Network Architecture: BERT

Input:

Input Type(s): Text
Input Format(s): Int tokens
Input Parameters: 1D

Output:

Output Type(s): Embedding vectors
Output Format: Vector
Output Parameters: Two Dimensional (2D)

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Lovelace

Supported Operating System(s):

  • Linux, Windows

Model Version(s):

  • BERT base model (uncased) q4_k_s 1.0

Training and Evaluation Datasets:

Training Dataset:
Links: legacy-datasets/wikipedia and bookcorpus/bookcorpus
Data Collection Method by dataset: Unknown
Labeling Method by dataset: Unknown
Properties:

  • wikipedia dataset
    • ~12GB of data based on wikipedia articles
    • More information on Hugging Face model card
  • Bookcorpus
    • 7,185 unique books
    • More information on Hugging Face model card

Evaluation Dataset:
Data Collection Method by dataset: Unknown
Labeling Method by dataset: Unknown
Properties: See Model Card under “Glue Test Results.”

Inference:

Engine: ONNX
Test Hardware : RTX 4090

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.


Model Overview : MelloTTS v2 and v3

Description:

MelloTTS is a Text-to-Speech (TTS) model used in the OpenVoice TTS solution as a base model (before doing voice conversion). The differences between V2 and V3 are only the weights. V3 has better voice (more realistic) quality but fewer speakers.

This model is ready for commercial/non-commercial use.

Model Developer: Myshell-ai

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see:

  • myshell-ai/MeloTTS-English-v2.
  • myshell-ai/MeloTTS-English-v3.

License/Terms of Use:

This model is distributed under MIT licenseMelloTTS License. Please refer to MelloTTS Model Card for further details.

Reference(s):

  • myshell-ai/MeloTTS-English-v2.
  • myshell-ai/MeloTTS-English-v3.

Model Architecture:

Architecture Type: Transformer/Flow Network
Network Architecture: ViTs architecture

Input:

Input Type(s): phonemes, tones, embedding, speaker id
Input Format(s): Int, int, float, int
Input Parameters: 1D for all parameters

Output:

Output Type(s): Audio at a sampling rate of 44100
Output Format: Vector
Output Parameters: 2D

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ada

Supported Operating System(s):

  • Windows, Linux

Model Version(s):

  • MELLO TTS float16 v2
  • MELLO TTS float16 v3

Training, Testing, and Evaluation Datasets:

Data Collection Method by dataset: Unknown
Labeling Method by dataset: Unknown
Properties: The dataset used to train this model is not known. The full testing dataset is also not known, however, their website features numerous sample outputs.

Inference:

Engine: Onnx
Test Hardware: RTX 4090

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.


Model Overview : OpenVoice Converter Model

Description:

The OpenVoice converter model is used in the OpenVoice TTS solution to clone the voice of a reference speaker and apply it to an output audio (output of the base model).

This model is ready for commercial/non-commercial use.

Model Developer: Myshell-ai

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see:

  • myshell-ai/OpenVoiceV2.

License/Terms of Use:

This model is distributed under MIT licenseOpenVoice License. Please refer to myshell-ai/OpenVoiceV2 for further details.

Reference(s):

  • myshell-ai/OpenVoiceV2.
  • OpenVoice paper

Model Architecture:

Architecture Type: Transformer/Flow Network
Network Architecture: Encoder-Decoder structure with an invertible normalizing flow.

Input:

Input Type(s): Audio at a sampling rate of 22050, Reference spectrogram
Input Format(s): Float, Float
Input Parameters: 1D

Output:

Output Type(s): Audio at a sampling rate of 22050
Output Format: Vector
Output Parameters: 2D

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ada

Supported Operating System(s):

  • Windows, Linux

Model Version(s):

OpenVoice converter TTS float16 v2

Training, Testing, and Evaluation Datasets:

Data Collection Method by dataset: Unknown
Labeling Method by dataset: Unknown
Properties: The dataset used to train this model is not known. The full testing dataset is also not known, however, their website features numerous sample outputs.

Inference:

Engine: Onnx
Test Hardware: RTX 4090

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.