Multilingual Distil Whisper Large-v3

NVIDIA

Model

NVIDIA

Multilingual Distil Whisper Large-v3

Model Overview

Description:

This model is used to transcribe short-form audio files and is designed to be compatible with OpenAI's sequential long-form transcription algorithm. Distil-Whisper was proposed in the paper Robust Knowledge Distillation via Large-Scale Pseudo Labeling. This is the third installment of the Distil-Whisper English series. It is the knowledge distilled version of OpenAI's Whisper large-v3. distil-whisper-large-v3 is one of the 5 configurations of the model available with 1550M parameters. This model version is optimized to run with NVIDIA TensorRT-LLM. This model is ready for commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see https://huggingface.co/distil-whisper/distil-large-v3

License/Terms of Use:

This model is governed by the NVIDIA RIVA License Agreement.

Disclaimer: AI models generate responses and outputs based on complex algorithms and machine learning techniques, and those responses or outputs may be inaccurate or offensive. By downloading a model, you assume the risk of any harm caused by any response or output of the model. By using this software or model, you are agreeing to the terms and conditions of the license, acceptable use policy and HuggingFace privacy policy. DistilWhisper is released under the MIT License.

References:

Distil-whisper paper:

@misc{gandhi2023distilwhisper,
      title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling}, 
      author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},
      year={2023},
      eprint={2311.00430},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Whisper paper:

@misc{radford2022robust,
      title={Robust Speech Recognition via Large-Scale Weak Supervision}, 
      author={Alec Radford and Jong Wook Kim and Tao Xu and Greg Brockman and Christine McLeavey and Ilya Sutskever},
      year={2022},
      eprint={2212.04356},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

Model Architecture:

Architecture Type: Transformer (Encoder-Decoder) Network Architecture: Whisper

Input:

Input Type(s): Audio, Text-Prompt Input Format(s): Linear PCM 16-bit 1 channel (Audio), String (Text Prompt) Input Parameters: One-Dimensional (1D)

Output:

Output Type(s): Text Output Format: String Output Parameters: 1D

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Blackwell

Supported Operating System(s):

Linux

Model Version(s):

Large-v3: knowledge distilled version of OpenAI's Whisper large-v3.

Training Dataset:

Data Collection Method by dataset: Human
Labeling Method by dataset: Automatic
Properties (Quantity, Dataset Descriptions, Sensor(s)): 22,000 hours of audio data from nine open-source, permissively licensed speech datasets on the Hugging Face Hub comprised of 50k speakers from 10 distinct domains.

Inference:

Engine: Tensor(RT)-LLM, Triton Test Hardware:

A100
H100

For more detail on model usage, evaluation, training data set and implications, please refer to Whisper distil-large-v3 Model Card.

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns here.

Publisher

NVIDIA

Latest Version3.0

UpdatedOctober 1, 2024 UTC

Compressed Size2.82 GB