# nvCLIP4STR Model Card ## nvCLIP4STR Overview nvCLIP4STR is optical character recognition network, which aims to recognize characters from the images. One pretrained nvCLIP4STR model is delivered, which is trained on 7 dataset with alphanumeric labels. This model is ready for commercial use. ## Deployment Geography: Global ## Use Case This model can be used to recognize texts in an image in computer vision applications. ## Release Date: NGC [05/30/2025] ## License License to use these models is covered by the NVIDIA Open Model License. By downloading the model, you accept the terms and conditions of these [licenses](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). ## Model Architecture **Architecture Type:** Convolution Neural Network + Transformer Encoder Decoder. **Network Architecture:** This model was developed based on the [CLIP4STR](https://arxiv.org/abs/2305.14014), which is a optical character recognition model. The nvCLIP4STR has two encoder decoder branches: a visual branch and a cross-modal branch. The visual branch provides an initial prediction based on the visual feature, and the cross-modal branch refines this prediction by addressing the discrepancy between the visual feature and text semantics. ### Input - **Input Type:** Image - **Input Formats:** Red, Green, Blue (RGB) - **Input Parameters:** Two-Dimensional (2D) - **Other Properties Related to Input:** 224x224 resolution required; no alpha channel or bits ### Output - **Output Type(s):** Text - **Output Format:** String - **Output Parameters:** One-Dimensional (1D) - **Other Properties Related to Output:** Texts including numbers, capital and lower-case letters and symbols. Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. ## Software Integration: ### Runtime Engine(s): - TAO 6.0.0 ### Supported Hardware Microarchitecture Compatibility: - NVIDIA Ampere - NVIDIA Jetson - NVIDIA Hopper - NVIDIA Volta ### [Preferred/Supported] Operating System(s): - Linux ## Model versions: - **deployable_v1.0** - Models deployable with CLIP-based backbone. ## Training and Evaluation Datasets - **The total size:** 6M images - **Total number of datasets:** 7 training datasets, 3 evaluation datasets - **Dataset partition:** training and evaluation are different datasets ### Training Datasets **Link:** * [ICDAR15](https://rrc.cvc.uab.es/?ch=4) * [MLT19](https://rrc.cvc.uab.es/?ch=15) * [Uber-Text](https://s3-us-west-2.amazonaws.com/uber-common-public/ubertext/index.html) * [COCO-Text v2.0](https://bgshih.github.io/cocotext/) * [OpenVINO](https://proceedings.mlr.press/v157/krylov21a.html) * [TextOCR](https://textvqa.org/textocr/) * [Union14M-L](https://github.com/Mountchicken/Union14M/tree/main?tab=readme-ov-file#31-union14m-l) **Data Collection Method by dataset:**
* Hybrid: Automated, Human
**Labeling Method by dataset:**
* Hybrid: Automated, Human
**Properties:**
|dataset|image numbers| |:---:|:---:| |ICDAR2015|4468| |MLT19|45551| |Uber-Text|91978| |COCO-Text v2.0|59820| |OpenVINO|1912794| |TextOCR|714770| |Union14M-L|3220666| ### Evaluation Datasets **Link:** * [ICDAR13](https://rrc.cvc.uab.es/?ch=2) * [ICDAR15](https://rrc.cvc.uab.es/?ch=4) * [Uber-Text](https://s3-us-west-2.amazonaws.com/uber-common-public/ubertext/index.html) **Data Collection Method by dataset:**
* Hybrid: Automated, Human
**Labeling Method by dataset:**
* Hybrid: Automated, Human
**Properties:**
|dataset|eval image numbers| |:---:|:---:| |ICDAR13|1015| |ICDAR15|2077| |Uber-Text|80551| ## Performance We test the nvCLIP4STR model on three different datasets: Uber-Text validation dataset, ICDAR13 and ICDAR15 scene text benchmark dataset. And we compare nvCLIP4STR against the previous [OCRNet](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/ocrnet) in TAO. ### Methodology and KPI The key performance indicator is the accuracy of character recognition. The accurate recognition means all the characters in a text area are recognized correctly. The KPIs for the evaluation data are reported below. |model|dataset|accuracy| |---|---|---| |nvCLIP4STR|Uber-Text|89.34%| |nvCLIP4STR|ICDAR13|98.42%| |nvCLIP4STR|ICDAR15|90.08%| |ocrnet_resnet50_unpruned|Uber-Text + TextOCR validation|77.1%| |ocrnet_resnet50_unpruned|ICDAR13|91.8%| |ocrnet_resnet50_unpruned|ICDAR15|78.6%| |ocrnet_resnet50_unpruned|Internal PCB validation|74.1%| |ocrnet_resnet50_pruned|ICDAR13|92.6%| |ocrnet-vit|Uber-Text + TextOCR validation|83.7%| |ocrnet-vit|ICDAR13|95.5%| |ocrnet-vit|ICDAR15|84.7%| |ocrnet-vit-pcb|Internal PCB validation|84.2%| ## Inference **Acceleration Engine:** TensorRT
**Test Hardware:**
- DGX A100 ## How to use this model This model needs to be used with NVIDIA Hardware and Software. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. This model can only be used with [NVIDIA-Optical-Character-Detection-and-Recognition-Solution](https://github.com/NVIDIA-AI-IOT/NVIDIA-Optical-Character-Detection-and-Recognition-Solution) (nvOCDR lib), you can try to deploy nvCLIP4STR and run c++ inference in this nvOCDR lib Primary use case intended for this model is to recognize the characters from the detected text region. There is one type of nvCLIP4STR provided: - deployable (unpruned) The `deployable` model is in `onnx` format. The `deployable` models can be deployed in TensorRT and nvOCDR. ### Instructions to use the model with nvOCDR nvOCDR lib can support nvCLIP4STR now. For more information of c++ inference, please refer to the [nvCLIP4STR C++ Sample](https://github.com/NVIDIA-AI-IOT/NVIDIA-Optical-Character-Detection-and-Recognition-Solution/tree/main/c%2B%2B_samples#nvclip4str). ## Using TAO Pre-trained Models - Get [TAO Container](https://ngc.nvidia.com/catalog/containers/nvidia:tao:tao-toolkit) - Get other purpose-built models from the NGC model registry: - [TrafficCamNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:trafficcamnet) - [PeopleNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:peoplenet) - [PeopleNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:peoplenet) - [PeopleNet-Transformer](https://ngc.nvidia.com/catalog/models/nvidia:tao:peoplenet_transformer) - [DashCamNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:dashcamnet) - [FaceDetectIR](https://ngc.nvidia.com/catalog/models/nvidia:tao:facedetectir) - [VehicleMakeNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:vehiclemakenet) - [VehicleTypeNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:vehicletypenet) - [PeopleSegNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:peoplesegnet) - [PeopleSemSegNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:peoplesemsegnet) - [License Plate Detection](https://ngc.nvidia.com/catalog/models/nvidia:tao:lpdnet) - [License Plate Recognition](https://ngc.nvidia.com/catalog/models/nvidia:tao:lprnet) - [Gaze Estimation](https://ngc.nvidia.com/catalog/models/nvidia:tao:gazenet) - [Facial Landmark](https://ngc.nvidia.com/catalog/models/nvidia:tao:fpenet) - [Heart Rate Estimation](https://ngc.nvidia.com/catalog/models/nvidia:tao:heartratenet) - [Gesture Recognition](https://ngc.nvidia.com/catalog/models/nvidia:tao:gesturenet) - [Emotion Recognition](https://ngc.nvidia.com/catalog/models/nvidia:tao:emotionnet) - [FaceDetect](https://ngc.nvidia.com/catalog/models/nvidia:tao:facenet) - [2D Body Pose Estimation](https://ngc.nvidia.com/catalog/models/nvidia:tao:bodyposenet) - [ActionRecognitionNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:actionrecognitionnet) - [ActionRecognitionNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:actionrecognitionnet) - [PoseClassificationNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:poseclassificationnet) - [People ReIdentification](https://ngc.nvidia.com/catalog/models/nvidia:tao:reidentificationnet) - [PointPillarNet](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/pointpillarnet) - [CitySegFormer](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/citysemsegformer) - [Retail Object Detection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/retail_object_detection) - [Retail Object Embedding](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/retail_object_recognition) - [Optical Inspection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/optical_inspection) - [Optical Character Detection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/ocdnet) - [Optical Character Recognition](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/ocrnet) - [PCB Classification](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/pcb_classification) - [PeopleSemSegFormer](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/peoplesemsegformer) - [LPDNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:lpdnet) - [License Plate Recognition](https://ngc.nvidia.com/catalog/models/nvidia:tao:lprnet) - [Gaze Estimation](https://ngc.nvidia.com/catalog/models/nvidia:tao:gazenet) - [Facial Landmark](https://ngc.nvidia.com/catalog/models/nvidia:tao:fpenet) - [Heart Rate Estimation](https://ngc.nvidia.com/catalog/models/nvidia:tao:heartratenet) - [Gesture Recognition](https://ngc.nvidia.com/catalog/models/nvidia:tao:gesturenet) - [Emotion Recognition](https://ngc.nvidia.com/catalog/models/nvidia:tao:emotionnet) - [FaceDetect](https://ngc.nvidia.com/catalog/models/nvidia:tao:facenet) - [2D Body Pose Estimation](https://ngc.nvidia.com/catalog/models/nvidia:tao:bodyposenet) - [ActionRecognitionNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:actionrecognitionnet) - [ActionRecognitionNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:actionrecognitionnet) - [PoseClassificationNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:poseclassificationnet) - [People ReIdentification](https://ngc.nvidia.com/catalog/models/nvidia:tao:reidentificationnet) - [PointPillarNet](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/pointpillarnet) - [CitySegFormer](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/citysemsegformer) - [Retail Object Detection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/retail_object_detection) - [Retail Object Embedding](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/retail_object_recognition) - [Optical Inspection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/optical_inspection) - [Optical Character Detection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/ocdnet) - [Optical Character Recognition](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/ocrnet) - [PCB Classification](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/pcb_classification) - [PeopleSemSegFormer](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/peoplesemsegformer) ## Technical blogs - [Access the latest in Vision AI development workflows with NVIDIA TAO Toolkit 5.0](https://developer.nvidia.com/blog/access-the-latest-in-vision-ai-model-development-workflows-with-nvidia-tao-toolkit-5-0/) - [Improve accuracy and robustness of vision ai models with vision transformers and NVIDIA TAO](https://developer.nvidia.com/blog/improve-accuracy-and-robustness-of-vision-ai-apps-with-vision-transformers-and-nvidia-tao/) - [Train like a ‘pro’ without being an AI expert using TAO AutoML](https://developer.nvidia.com/blog/training-like-an-ai-pro-using-tao-automl/) - [Create Custom AI models using NVIDIA TAO Toolkit with Azure Machine Learning ](https://developer.nvidia.com/blog/creating-custom-ai-models-using-nvidia-tao-toolkit-with-azure-machine-learning/) - [Developing and Deploying AI-powered Robots with NVIDIA Isaac Sim and NVIDIA TAO](https://developer.nvidia.com/blog/developing-and-deploying-ai-powered-robots-with-nvidia-isaac-sim-and-nvidia-tao/) - Learn endless ways to adapt and supercharge your AI workflows with TAO - [Whitepaper](https://developer.nvidia.com/tao-toolkit-usecases-whitepaper/1-introduction) - [Customize Action Recognition with TAO and deploy with DeepStream](https://developer.nvidia.com/blog/developing-and-deploying-your-custom-action-recognition-application-without-any-ai-expertise-using-tao-and-deepstream/) - Read the 2 part blog on training and optimizing 2D body pose estimation model with TAO - [Part 1](https://developer.nvidia.com/blog/training-optimizing-2d-pose-estimation-model-with-tao-toolkit-part-1) | [Part 2](https://developer.nvidia.com/blog/training-optimizing-2d-pose-estimation-model-with-tao-toolkit-part-2) - Learn how to train [real-time License plate detection and recognition app](https://developer.nvidia.com/blog/creating-a-real-time-license-plate-detection-and-recognition-app) with TAO and DeepStream. - Model accuracy is extremely important, learn how you can achieve [state of the art accuracy for classification and object detection models](https://developer.nvidia.com/blog/preparing-state-of-the-art-models-for-classification-and-object-detection-with-tao-toolkit/) using TAO ## Suggested reading - More information on about TAO Toolkit and pre-trained models can be found at the [NVIDIA Developer Zone](https://developer.nvidia.com/tao-toolkit) - Read the [TAO getting Started](https://docs.nvidia.com/tao/tao-toolkit/) guide and [release notes](https://docs.nvidia.com/tao/tao-toolkit/text/release_notes.html). - If you have any questions or feedback, please refer to the discussions on [TAO Toolkit Developer Forums](https://forums.developer.nvidia.com/c/accelerated-computing/intelligent-video-analytics/tao-toolkit/17) - Deploy your model on the edge using DeepStream. Learn more about DeepStream SDK https://developer.nvidia.com/deepstream-sdk ## Ethical Considerations NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ [Explainability](./explainability-example.md), [Bias](./bias-example.md), [Safety & Security](./safety-example.md), and [Privacy](./privacy-example.md) Subcards. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). ## Reference - Zhao, Shuai, et al. "CLIP4STR: a simple baseline for scene text recognition with pre-trained vision-language model." IEEE Transactions on Image Processing (2024). - Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PmLR, 2021.