Model Overview

Description:

Whisper-large-v3-turbo is used to transcribe short-form audio files and is designed to be compatible with OpenAI's sequential long-form transcription algorithm. It’s a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on >5M hours of labeled data, Whisper models demonstrate a strong ability to generalize to many datasets and domains without the need for fine-tuning. Whisper-large-v3-turbo is a fine-tuned version of a pruned Whisper large-v3, with the number of decoding layers reduced from 32 to 4. As a result, the model improves transcription speed while causing minimal degradation in accuracy. See [paper] (https://arxiv.org/abs/2311.00430) for more information.

This model version is optimized to run with NVIDIA TensorRT-LLM.

This model is ready for commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA (Whisper-Large-v3-turbo) Model Card. .

License/Terms of Use:

GOVERNING TERMS: Use of the model is governed by the NVIDIA Community Model License (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/). ADDITIONAL INFORMATION: MIT license.

Deployment Geography:

Global

Use Case:

Developers or end users for speech transcription use cases.

Release Date:

03/07/2025

References:

Whisper website
Whisper paper:

@misc{radford2022robust,
      title={Robust Speech Recognition via Large-Scale Weak Supervision}, 
      author={Alec Radford and Jong Wook Kim and Tao Xu and Greg Brockman and Christine McLeavey and Ilya Sutskever},
      year={2022},
      eprint={2212.04356},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

Model Architecture:

Architecture Type: Transformer (Encoder-Decoder)
Network Architecture: Whisper

Input:

Input Type(s): Audio, Text-Prompt
Input Format(s): Linear PCM 16-bit 1 channel (Audio), String (Text Prompt)
Input Parameters: One-Dimensional (1D), One-Dimensional (1D)

Other Properties Related to Input: Audio duration: (0 to 30 sec), prompt tokens: (5 to 114 tokens)

Output:

Output Type(s): Text Output Format: String Output Parameters: 1D

##Software Integration: *Runtime Engine:

Riva - 2.19.0

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Blackwell

Supported Operating System(s): Linux

Model Version(s):

Large-v3-turbo: Whisper large-v3-turbo has the same architecture as the large-v3 models, except for the following minor differences:

The number of decoder layers was reduced from 32 to 4.

Training and Evaluation Datasets:

For more details on model usage, evaluation, training dataset and implications, please refer to Whisper large-v3-turbo Model Card.

Data Collection Method by dataset: [Hybrid: Human, Automatic]
Labeling Method by dataset: [Automated]

Additional details on model evaluations can be found here.

Inference:

Engine: Tensor(RT)-LLM, Triton
Test Hardware:

A100
H100

Limitations

Please review the Whisper-Large-v3-Turbo Model Card for more information regarding limitations. The publisher (Open AI) has included cautions against certain uses under "Evaluated Use" and highlighted the model's limitations under "Performance and Limitations" and "Broader Implications.”

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.