NGC Catalog
CLASSIC
Welcome Guest
Models
nvCLIP4STR

nvCLIP4STR

For downloads and more information, please view on a desktop device.
Description
nvCLIP4STR is optical character recognition network, which aims to recognize characters from the images. One pretrained nvCLIP4STR model is delivered, which is trained on 7 dataset with alphanumeric labels. This model is ready for commercial use.
Publisher
NVIDIA
Latest Version
deployable_v1.0
Modified
August 5, 2025
Size
1.62 GB

nvCLIP4STR Model Card

nvCLIP4STR Overview

nvCLIP4STR is optical character recognition network, which aims to recognize characters from the images. One pretrained nvCLIP4STR model is delivered, which is trained on 7 dataset with alphanumeric labels. This model is ready for commercial use.

Deployment Geography:

Global

Use Case

This model can be used to recognize texts in an image in computer vision applications.

Release Date:

NGC [05/30/2025]

License

Use of this model is governed by the NVIDIA Community Models License

Model Architecture

Architecture Type: Convolution Neural Network + Transformer Encoder Decoder. Network Architecture: This model was developed based on the CLIP4STR, which is a optical character recognition model. The nvCLIP4STR has two encoder decoder branches: a visual branch and a cross-modal branch. The visual branch provides an initial prediction based on the visual feature, and the cross-modal branch refines this prediction by addressing the discrepancy between the visual feature and text semantics.

Input

  • Input Type: Image
  • Input Formats: Red, Green, Blue (RGB)
  • Input Parameters: Two-Dimensional (2D)
  • Other Properties Related to Input: 224x224 resolution required; no alpha channel or bits

Output

  • Output Type(s): Text
  • Output Format: String
  • Output Parameters: One-Dimensional (1D)
  • Other Properties Related to Output: Texts including numbers, capital and lower-case letters and symbols.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

  • TAO 6.0.0

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Jetson
  • NVIDIA Hopper
  • NVIDIA Volta

[Preferred/Supported] Operating System(s):

  • Linux

Model versions:

  • deployable_v1.0 - Models deployable with CLIP-based backbone.

Training and Evaluation Datasets

  • The total size: 6M images
  • Total number of datasets: 7 training datasets, 3 evaluation datasets
  • Dataset partition: training and evaluation are different datasets

Training Datasets

Link:

  • ICDAR15
  • MLT19
  • Uber-Text
  • COCO-Text v2.0
  • OpenVINO
  • TextOCR
  • Union14M-L

Data Collection Method by dataset:

  • Hybrid: Automated, Human

Labeling Method by dataset:

  • Hybrid: Automated, Human

Properties:

dataset image numbers
ICDAR2015 4468
MLT19 45551
Uber-Text 91978
COCO-Text v2.0 59820
OpenVINO 1912794
TextOCR 714770
Union14M-L 3220666

Evaluation Datasets

Link:

  • ICDAR13
  • ICDAR15
  • Uber-Text

Data Collection Method by dataset:

  • Hybrid: Automated, Human

Labeling Method by dataset:

  • Hybrid: Automated, Human

Properties:

dataset eval image numbers
ICDAR13 1015
ICDAR15 2077
Uber-Text 80551

Performance

We test the nvCLIP4STR model on three different datasets: Uber-Text validation dataset, ICDAR13 and ICDAR15 scene text benchmark dataset. And we compare nvCLIP4STR against the previous OCRNet in TAO.

Methodology and KPI

The key performance indicator is the accuracy of character recognition. The accurate recognition means all the characters in a text area are recognized correctly. The KPIs for the evaluation data are reported below.

model dataset accuracy
nvCLIP4STR Uber-Text 89.34%
nvCLIP4STR ICDAR13 98.42%
nvCLIP4STR ICDAR15 90.08%
ocrnet_resnet50_unpruned Uber-Text + TextOCR validation 77.1%
ocrnet_resnet50_unpruned ICDAR13 91.8%
ocrnet_resnet50_unpruned ICDAR15 78.6%
ocrnet_resnet50_unpruned Internal PCB validation 74.1%
ocrnet_resnet50_pruned ICDAR13 92.6%
ocrnet-vit Uber-Text + TextOCR validation 83.7%
ocrnet-vit ICDAR13 95.5%
ocrnet-vit ICDAR15 84.7%
ocrnet-vit-pcb Internal PCB validation 84.2%

Inference

Acceleration Engine: TensorRT

Test Hardware:

  • DGX A100

How to use this model

This model needs to be used with NVIDIA Hardware and Software. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. This model can only be used with NVIDIA-Optical-Character-Detection-and-Recognition-Solution (nvOCDR lib), you can try to deploy nvCLIP4STR and run c++ inference in this nvOCDR lib

Primary use case intended for this model is to recognize the characters from the detected text region.

There is one type of nvCLIP4STR provided:

  • deployable (unpruned)

The deployable model is in onnx format. The deployable models can be deployed in TensorRT and nvOCDR.

Instructions to use the model with nvOCDR

nvOCDR lib can support nvCLIP4STR now. For more information of c++ inference, please refer to the nvCLIP4STR C++ Sample.

Using TAO Pre-trained Models

  • Get TAO Container
  • Get other purpose-built models from the NGC model registry:
    • TrafficCamNet
    • PeopleNet
    • PeopleNet
    • PeopleNet-Transformer
    • DashCamNet
    • FaceDetectIR
    • VehicleMakeNet
    • VehicleTypeNet
    • PeopleSegNet
    • PeopleSemSegNet
    • License Plate Detection
    • License Plate Recognition
    • Gaze Estimation
    • Facial Landmark
    • Heart Rate Estimation
    • Gesture Recognition
    • Emotion Recognition
    • FaceDetect
    • 2D Body Pose Estimation
    • ActionRecognitionNet
    • ActionRecognitionNet
    • PoseClassificationNet
    • People ReIdentification
    • PointPillarNet
    • CitySegFormer
    • Retail Object Detection
    • Retail Object Embedding
    • Optical Inspection
    • Optical Character Detection
    • Optical Character Recognition
    • PCB Classification
    • PeopleSemSegFormer
    • LPDNet
    • License Plate Recognition
    • Gaze Estimation
    • Facial Landmark
    • Heart Rate Estimation
    • Gesture Recognition
    • Emotion Recognition
    • FaceDetect
    • 2D Body Pose Estimation
    • ActionRecognitionNet
    • ActionRecognitionNet
    • PoseClassificationNet
    • People ReIdentification
    • PointPillarNet
    • CitySegFormer
    • Retail Object Detection
    • Retail Object Embedding
    • Optical Inspection
    • Optical Character Detection
    • Optical Character Recognition
    • PCB Classification
    • PeopleSemSegFormer

Technical blogs

  • Access the latest in Vision AI development workflows with NVIDIA TAO Toolkit 5.0
  • Improve accuracy and robustness of vision ai models with vision transformers and NVIDIA TAO
  • Train like a ‘pro’ without being an AI expert using TAO AutoML
  • Create Custom AI models using NVIDIA TAO Toolkit with Azure Machine Learning
  • Developing and Deploying AI-powered Robots with NVIDIA Isaac Sim and NVIDIA TAO
  • Learn endless ways to adapt and supercharge your AI workflows with TAO - Whitepaper
  • Customize Action Recognition with TAO and deploy with DeepStream
  • Read the 2 part blog on training and optimizing 2D body pose estimation model with TAO - Part 1 | Part 2
  • Learn how to train real-time License plate detection and recognition app with TAO and DeepStream.
  • Model accuracy is extremely important, learn how you can achieve state of the art accuracy for classification and object detection models using TAO

Suggested reading

  • More information on about TAO Toolkit and pre-trained models can be found at the NVIDIA Developer Zone
  • Read the TAO getting Started guide and release notes.
  • If you have any questions or feedback, please refer to the discussions on TAO Toolkit Developer Forums
  • Deploy your model on the edge using DeepStream. Learn more about DeepStream SDK https://developer.nvidia.com/deepstream-sdk

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.

Reference

  • Zhao, Shuai, et al. "CLIP4STR: a simple baseline for scene text recognition with pre-trained vision-language model." IEEE Transactions on Image Processing (2024).
  • Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PmLR, 2021.