Optical Character Detection

Optical Character Detection

Logo for Optical Character Detection
Network to detect characters in an image.
Latest Version
March 1, 2024
33.91 MB

OCDNet Model Card

Model Overview

The model described in this card is an optical characters detection network, which aims to detect text in images. Trainable and deployable OCDNet models are provided. These are trained on Uber-Text dataset and ICDAR2015 dataset respectively.

Model Architecture

This model is based on a relatively sophisticated text detection network called DBNet. DBNet is a network architecture for real-time scene text detection with differentiable binarization. It aims to solve the problem of text localization and segmentation in natural images with complex backgrounds and various text shapes.


The training algorithm inserts the binarization operation into the segmentation network and jointly optimizes it so that the network can learn to separate foreground and background pixels more effectively. The binarization threshold is learned by minimizing the IoU loss between the predicted binary map and the ground truth binary map.

Training Data

The trainable models were trained on the Uber-Text dataset and ICDAR2015 dataset respectively. The Uber-Text dataset contains street-level images collected from car mounted sensors and truths annotated by a team of image analysts--including train_4Kx4K, train_1Kx1K, val_4Kx4K, val_1Kx1K, test_4Kx4K as the training datasets and test_1Kx1K as the validation dataset. The dataset was constructed with 107812 images for training and 10157 images for validation. The ICDAR2015 dataset contains 1000 training images and 500 test images. The deployable models were ONNX models that were exported using the trainable models.


Evaluation Data

The OCDNet model was evaluated using the Uber-Text test dataset and ICDAR2015 test dataset.

Methodology and KPI

The key performance indicator is the hmean of detection. The KPI for the evaluation data are reported below.

model test dataset hmean
ocdnet_deformable_resnet18 Uber-Text 81.1%
ocdnet_deformable_resnet50 Uber-Text 82.2%
ocdnet_fan_tiny_2x_ubertext.pth Uber-Text 86.0%
ocdnet_fan_tiny_2x_icdar.pth ICDAR2015 85.3%
ocdnet_fan_tiny_2x_icdar_pruned.pth ICDAR2015 84.8%
ocdnet_vit_pcb.pth Internal PCB validation 69.3%

Real-time Inference Performance

The inference uses FP16 precision. The input shape is <batch>x3x640x640. The inference performance runs against an OCDNet-deployable model with trtexec on AGX Orin, Orin NX, Orin Nano, NVIDIA L4, NVIDIA L4, and NVIDIA A100 GPUs. The Jetson devices run at Max-N configuration for maximum system performance. The data is for inference-only performance. The end-to-end performance with streaming video data might vary slightly depending on the applications use case.

Model Device precision batch_size FPS
ocdnet_deformable_resnet18 Orin Nano FP16 32 31
ocdnet_deformable_resnet18 Orin NX FP16 32 46
ocdnet_deformable_resnet18 AGX Orin FP16 32 122
ocdnet_deformable_resnet18 T4 FP16 32 294
ocdnet_deformable_resnet18 L4 FP16 32 432
ocdnet_deformable_resnet18 A100 FP16 32 1786
ocdnet_fan_tiny_2x_icdar Orin Nano FP16 1 0.57
ocdnet_fan_tiny_2x_icdar AGX Orin FP16 1 2.24
ocdnet_fan_tiny_2x_icdar T4 FP16 1 2.74
ocdnet_fan_tiny_2x_icdar L4 FP16 1 5.36
ocdnet_fan_tiny_2x_icdar A30 FP16 1 8.34
ocdnet_fan_tiny_2x_icdar L40 FP16 1 15.01
ocdnet_fan_tiny_2x_icdar A100-sxm4-80gb FP16 1 16.61
ocdnet_fan_tiny_2x_icdar H100-sxm-80gb-hbm3 FP16 1 29.13
ocdnet_fan_tiny_2x_icdar_pruned Orin Nano FP16 2 0.79
ocdnet_fan_tiny_2x_icdar_pruned Orin NX FP16 2 1.18
ocdnet_fan_tiny_2x_icdar_pruned AGX Orin FP16 2 3.08
ocdnet_fan_tiny_2x_icdar_pruned A2 FP16 1 2.30
ocdnet_fan_tiny_2x_icdar_pruned T4 FP16 2 3.51
ocdnet_fan_tiny_2x_icdar_pruned L4 FP16 1 7.23
ocdnet_fan_tiny_2x_icdar_pruned A30 FP16 2 11.37
ocdnet_fan_tiny_2x_icdar_pruned L40 FP16 2 19.04
ocdnet_fan_tiny_2x_icdar_pruned A100-sxm4-80gb FP16 2 22.66
ocdnet_fan_tiny_2x_icdar_pruned H100-sxm-80gb-hbm3 FP16 2 40.07

How to Use This Model

This model needs to be used with NVIDIA Hardware and Software: The model can run on any NVIDIA GPU, including NVIDIA Jetson devices, with TAO Toolkit, DeepStream SDK or TensorRT.

The primary use case for this model is to detect text on images.

There are two types of models provided (both unpruned).

  • trainable
  • deployable

The trainable models are intended for training with the user's own dataset using TAO Toolkit. This can provide high-fidelity models that are adapted to the use case. A Jupyter notebook is available as a part of the TAO container and can be used to re-train.

The deployable models share the same structure as the trainable model, but in onnx format. The deployable models can be deployed using TensorRT, nvOCDR, and DeepStream.


Images of C x H x W (H and W should be multiples of 32.)


BBox or polygon coordinates for each detected text in the input image

Instructions to Use the Model with TAO

To use these models as pretrained weights for transfer learning, use the snippet below as a template for the model component of the experiment spec file to train an OCDNet model. For more information on the experiment spec file, refer to the TAO Toolkit User Guide.

To use trainable_resnet18_v1.0 model:

  load_pruned_graph: False
  pruned_graph_path: '/results/prune/pruned_0.1.pth'
  pretrained_model_path: '/data/ocdnet/ocdnet_deformable_resnet18.pth'
  backbone: deformable_resnet18

To use trainable_ocdnet_vit_v1.0 model:

  load_pruned_graph: False
  pruned_graph_path: '/results/prune/pruned_0.1.pth'
  pretrained_model_path: '/data/ocdnet/ocdnet_fan_tiny_2x_icdar.pth'
  backbone: fan_tiny_8_p4_hybrid
  enlarge_feature_map_size: True
  activation_checkpoint: True

Instructions to deploy the model with DeepStream

To create the entire end-to-end video analytic application, deploy this model with DeepStream SDK. DeepStream SDK is a streaming analytic toolkit to accelerate building AI-based video analytic applications. DeepStream supports direct integration of this model into the Deepstream sample app.

To deploy this model with DeepStream, follow these instructions.


Restricted Usage in Different Fields

The NVIDIA OCDNet trainable model is trained on Uber Text, ICDAR2015 and PCB text dataset, which contains street-view images only. To get better accuracy in a specific field, more data is usually required to fine tune the pre-trained model with TAO Toolkit.

Model versions:

  • trainable_resnet18_v1.0 - Pre-trained models with deformable-resnet18 backbone, trained on Uber-Text dataset.
  • trainable_resnet50_v1.0 - Pre-trained models with deformable-resnet50 backbone, trained on Uber-Text dataset.
  • trainable_ocdnet_vit_v1.0 - Pre-trained models with fan-tiny backbone, trained on ICDAR2015 dataset.
  • trainable_ocdnet_vit_v1.1 - Pre-trained models with fan-tiny backbone, trained on Uber-Text dataset.
  • trainable_ocdnet_vit_v1.2 - Pre-trained models with fan-tiny backbone, trained on PCB dataset.
  • trainable_ocdnet_vit_v1.3 - Pre-trained models with fan-tiny backbone, trained on ImageNet2012 dataset.
  • trainable_ocdnet_vit_v1.4 - Pre-trained models with fan-tiny backbone, trained on ICDAR2015 dataset and model are pruned.
  • deployable_v1.0 - Model depolyable with deformable-resnet backbone.
  • deployable_v2.0 - Model depolyable with fan-tiny backbone, trained on ICDAR2015.
  • deployable_v2.1 - Model depolyable with fan-tiny backbone, trained on Uber-Text.
  • deployable_v2.2 - Model depolyable with fan-tiny backbone, trained on PCB dataset.
  • deployable_v2.3 - Model depolyable with fan-tiny backbone, trained on ICDAR2015 and model are pruned.



  • Liao M., Wan Z., Yao C., Chen K., Bai X.: Real-time Scene Text Detection with Differentiable Binarization (2020).
  • Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., and Wei, Y: Deformable convolutional networks. (2017).
  • He, W., Zhang, X., Yin, F., and Liu, C.: Deep direct regression for multi-oriented scene text detection. (2017).
  • Zhang, Y., Gueguen, L., Zharkov, I., Zhang, P., Seifert, K., & Kadlec, B. (2017, July). Uber-text: A large-scale dataset for optical character recognition from street-level imagery. In SUNw: Scene Understanding Workshop-CVPR (Vol. 2017, p. 5).
  • Zhou, D., Yu, Z., Xie, E., Xiao, C., Anandkumar, A., Feng, J., & Alvarez, J. M. (2022, June). Understanding the robustness in vision transformers. In International Conference on Machine Learning (pp. 27378-27394). PMLR.
  • Kuo, C. W., Ashmore, J. D., Huggins, D., & Kira, Z. (2019, January). Data-efficient graph embedding learning for PCB component detection. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 551-560). IEEE.

Using TAO Pre-trained Models


THe license to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.

Technical Blogs

Suggested Reading

Ethical AI

The NVIDIA OCDNet model detects optical characters.

NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developers to ensure that it meets the requirements for the relevant industry and use case, that the necessary instructions and documentation are provided to understand error rates, confidence intervals, and results, and that the model is being used under the conditions and in the manner intended.