nvCLIP4STR is optical character recognition network, which aims to recognize characters from the images. One pretrained nvCLIP4STR model is delivered, which is trained on 7 dataset with alphanumeric labels. This model is ready for commercial use.
Global
This model can be used to recognize texts in an image in computer vision applications.
NGC [05/30/2025]
Use of this model is governed by the NVIDIA Community Models License
Architecture Type: Convolution Neural Network + Transformer Encoder Decoder. Network Architecture: This model was developed based on the CLIP4STR, which is a optical character recognition model. The nvCLIP4STR has two encoder decoder branches: a visual branch and a cross-modal branch. The visual branch provides an initial prediction based on the visual feature, and the cross-modal branch refines this prediction by addressing the discrepancy between the visual feature and text semantics.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Link:
Data Collection Method by dataset:
Labeling Method by dataset:
Properties:
dataset | image numbers |
---|---|
ICDAR2015 | 4468 |
MLT19 | 45551 |
Uber-Text | 91978 |
COCO-Text v2.0 | 59820 |
OpenVINO | 1912794 |
TextOCR | 714770 |
Union14M-L | 3220666 |
Link:
Data Collection Method by dataset:
Labeling Method by dataset:
Properties:
dataset | eval image numbers |
---|---|
ICDAR13 | 1015 |
ICDAR15 | 2077 |
Uber-Text | 80551 |
We test the nvCLIP4STR model on three different datasets: Uber-Text validation dataset, ICDAR13 and ICDAR15 scene text benchmark dataset. And we compare nvCLIP4STR against the previous OCRNet in TAO.
The key performance indicator is the accuracy of character recognition. The accurate recognition means all the characters in a text area are recognized correctly. The KPIs for the evaluation data are reported below.
model | dataset | accuracy |
---|---|---|
nvCLIP4STR | Uber-Text | 89.34% |
nvCLIP4STR | ICDAR13 | 98.42% |
nvCLIP4STR | ICDAR15 | 90.08% |
ocrnet_resnet50_unpruned | Uber-Text + TextOCR validation | 77.1% |
ocrnet_resnet50_unpruned | ICDAR13 | 91.8% |
ocrnet_resnet50_unpruned | ICDAR15 | 78.6% |
ocrnet_resnet50_unpruned | Internal PCB validation | 74.1% |
ocrnet_resnet50_pruned | ICDAR13 | 92.6% |
ocrnet-vit | Uber-Text + TextOCR validation | 83.7% |
ocrnet-vit | ICDAR13 | 95.5% |
ocrnet-vit | ICDAR15 | 84.7% |
ocrnet-vit-pcb | Internal PCB validation | 84.2% |
Acceleration Engine: TensorRT
Test Hardware:
This model needs to be used with NVIDIA Hardware and Software. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. This model can only be used with NVIDIA-Optical-Character-Detection-and-Recognition-Solution (nvOCDR lib), you can try to deploy nvCLIP4STR and run c++ inference in this nvOCDR lib
Primary use case intended for this model is to recognize the characters from the detected text region.
There is one type of nvCLIP4STR provided:
The deployable
model is in onnx
format. The deployable
models can be deployed in TensorRT and nvOCDR.
nvOCDR lib can support nvCLIP4STR now. For more information of c++ inference, please refer to the nvCLIP4STR C++ Sample.
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.