NGC Catalog
CLASSIC
Welcome Guest
Models
Kotoba Whisper

Kotoba Whisper

For downloads and more information, please view on a desktop device.
Logo for Kotoba Whisper
Description
Kotoba Whisper
Publisher
NVIDIA
Latest Version
2.2
Modified
March 19, 2025
Size
2.82 GB

Model Overview

Description:

Kotoba-Whisper is used to transcribe short-form audio files and is designed to be compatible with OpenAI's sequential long-form transcription algorithm. It’s a collection of distilled Whisper models for Japanese Automatic Speech Recognition (ASR).

Kotoba-Whisper v2.2 was trained by employing OpenAI's Whisper large-v3 as the base model, and consists of the full encoder of the base (large-v3) model and the decoder with two layers initialized from the first and last layer of the large-v3 model.

This model version is optimized to run with NVIDIA TensorRT-LLM.

This model is ready for commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA [(Kotoba-Whisper-v2.2) Model Card] (https://huggingface.co/kotoba-tech/kotoba-whisper-v2.2).

License/Terms of Use:

GOVERNING TERMS: Use of the model is governed by the NVIDIA Community Model License (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/). ADDITIONAL INFORMATION: Apache 2.0 license.

Deployment Geography:

Global

Use Case:

Developers or end users for speech transcription use cases.

Release Date:

03/07/2025

References:

Distil-whisper paper:

@misc{gandhi2023distilwhisper,
      title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling}, 
      author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},
      year={2023},
      eprint={2311.00430},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Whisper paper:

@misc{radford2022robust,
      title={Robust Speech Recognition via Large-Scale Weak Supervision}, 
      author={Alec Radford and Jong Wook Kim and Tao Xu and Greg Brockman and Christine McLeavey and Ilya Sutskever},
      year={2022},
      eprint={2212.04356},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

Model Architecture:

Architecture Type: Transformer (Encoder-Decoder) Network Architecture: Whisper

Input:

Input Type(s): Audio, Text-Prompt Input Format(s): Linear PCM 16-bit 1 channel (Audio), String (Text Prompt) Input Parameters: One-Dimensional (1D), One-Dimensional (1D) Other Properties Related to Input: Audio duration: (0 to 30 sec), prompt tokens: (5 to 114 tokens)

Output:

Output Type(s): Text Output Format: String Output Parameters: 1D

##Software Integration: Runtime Engine:

  • Riva - 2.19.0

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Blackwell

Supported Operating System(s):

  • Linux

Model Version(s):

  • v2.2: finetuned and knowledge distilled version of [OpenAI's Whisper large-v3] (https://huggingface.co/openai/whisper-large-v3).

Training Dataset:

For more details on model usage, evaluation, training dataset and implications, please refer to Whisper Kotoba-Whisper Model Card.

Link: [here] (https://github.com/kotoba-tech/kotoba-whisper) Data Collection Method by dataset: Human
Labeling Method by dataset: Automatic
Properties: 10,000 hours of audio data from ReazonSpeech

Inference:

Engine: Tensor(RT)-LLM, Triton Test Hardware:

  • A100
  • H100

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.