Many AI applications have common needs: classification, object detection, language translation, text-to-speech, recommender engines, sentiment analysis, and more. When developing applications with these capabilities, it is much faster to start with a model that is pre-trained and then tune it for a specific use case. The NGC catalog offers pre-trained models for a variety of common AI tasks that are optimized for NVIDIA Tensor Core GPUs, and can be easily re-trained by updating just a few layers, saving valuable time.
This collection contains two models. 1) Multi-speaker 44100Hz FastPitch trained on approximately 20 hours of Latin American Spanish speech from 174 speakers. 2) HiFiGAN trained on mel spectrograms produced by the Multi-speaker FastPitch in (1).
For each word in the input text, the model: 1) predicts a punctuation mark that should follow the word (if any), the model supports commas, periods and question marks) and 2) predicts if the word should be capitalized or not.
Conformer-CTC-Large model for Russian Automatic Speech Recognition, trained on Mozilla Common Voice 10.0 (Russian), Golos (Russian), Russian LibriSpeech (RuLS) and SOVA (RuAudiobooksDevices, RuDevices) datasets.
Conformer-Transducer-Large model for Russian Automatic Speech Recognition, trained on Mozilla Common Voice 10.0 (Russian), Golos (Russian), Russian LibriSpeech (RuLS) and SOVA (RuAudiobooksDevices, RuDevices) datasets.
This model performs joint intent classification and slot filling, directly from audio input. The model treats the problem as an audio-to-text problem, where the output text is the flattened string representation of the semantics annotation.