NGC | Catalog
CatalogModelsOptical Character Recognition

Optical Character Recognition

For downloads and more information, please view on a desktop device.
Logo for Optical Character Recognition


Model to recognise characters from a preceding OCDNet model.



Latest Version



July 27, 2023


186.93 MB

OCRNet Model Card

Model Overview

The model described in this card is optical character recognition network, which aims to recognize characters from the gray images. One pretrained OCRNet model is delivered, which is trained on Uber-Text and TextOCR dataset with alphanumeric labels.

Model Architecture

This model is a sequence classification model with a ResNet50 backbone and TPS module. And it will take the gray image as network input and produce sequence output.


The training algorithm optimizes the network to minimize the connectionist temporal classification (CTC) loss between a ground truth characters sequence of a text image and a predicted characters sequence. Then characters will be decoded from the sequence output of the model through best path decoding method (greedy decoding).

Training Data

OCRNet pretrained model was trained on Uber-Text and TextOCR dataset. The Uber-Text contains street-level images collected from car mounted sensors and truths annotated by a team of image analysts. We choosed train_1Kx1K, train_4Kx4K and val_4Kx4K to as part of training dataset and val_1Kx1K as part of validation dataset. The TextOCR is the images with annotated texts from OpenImages dataset. After collecting the original data from Uber-text and TextOCR, We remove all the text images with * label in Uber-text and only keep alphanumeric text images with the maximum length is 25 in both datasets. We finally construct the dataset with 805007 text images for training and 24388 images for validation.

  • Characters distribution:

    character number
    0 66593
    1 78427
    2 57371
    3 41161
    4 35940
    5 38532
    6 29962
    7 32832
    8 25638
    9 24722
    a 266112
    b 58961
    c 113112
    d 109646
    e 338070
    f 63478
    g 67516
    h 104027
    i 213779
    j 10182
    k 36094
    l 144891
    m 86323
    n 202957
    o 224892
    p 74268
    q 5241
    r 203800
    s 186173
    t 221474
    u 87616
    v 35857
    w 43865
    x 12512
    y 52413
    z 9849
  • Character length distribution

    character length number
    1 94941
    2 120952
    3 146410
    4 146889
    5 82595
    6 67097
    7 55711
    8 37333
    9 23728
    10 14186
    11 7803
    12 3892
    13 1990
    14 708
    15 352
    16 157
    17 101
    18 62
    19 32
    20 18
    21 11
    22 14
    23 11
    24 9
    25 5

Data Format

The data format must be in the following format.


Each image is with one line of text. gt_list.txt contains all the ground truth text for the images and each image and its corresponding text will take one line as:

0000.jpg abc
0001.jpg defg
0002.jpg zxv

There is a characters_list.txt which has all the characters found in dataset. Each character takes one line.


Evaluation Data

We test the OCRNet model on three different datasets: Uber-Text + TextOCR validation dataset, ICDAR13 and ICDAR15 scene text benchmark dataset.

Methodology and KPI

The key performance indicator is the accuracy of character recognition. The accurate recognition means all the characters in a text area are recognized correctly.The KPI for the evaluation data are reported below.

model dataset accuracy
ocrnet_resnet50_unpruned Uber-Text + TextOCR validation 77.1%
ocrnet_resnet50_unpruned ICDAR13 91.8%
ocrnet_resnet50_unpruned ICDAR15 78.6%
ocrnet_resnet50_pruned ICDAR13 92.6%

Real-time Inference Performance

The inference uses FP16 precision. The inference performance runs with trtexec on AGX Orin, Orin NX, Orin Nano and NVIDIA L4 GPU. The Jetson devices run at Max-N configuration for maximum system performance. The data is the inference only performance. The end-to-end performance with streaming video data might slightly vary depending on use cases of applications.

Model Device precision batch_size FPS
ocrnet_resnet50_pruned Orin Nano FP16 128 981
ocrnet_resnet50_pruned Orin NX FP16 128 1399
ocrnet_resnet50_pruned AGX Orin FP16 128 3921
ocrnet_resnet50_pruned L4 FP16 512 6404

How to use this model

This model needs to be used with NVIDIA Hardware and Software. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. This model can only be used with TAO Toolkit (TAO), DeepStream SDK or TensorRT.

Primary use case intended for this model is to recognize the characters from the detected text region.

There are two types of models provided:

  • trainable (unpruned)
  • deployable (unpruned)

The trainable or unpruned models intended for training using TAO Toolkit and the user's own dataset. This can provide high fidelity models that are adapted to the use case. The Jupyter notebook available as a part of TAO container can be used to re-train.

The deployable models share the same structure as the trainable or unpruned model but in onnx format. The deployable models can be deployed in TensorRT, nvOCDR and DeepStream.

The trainable models are encrypted and can be decrypted with the following key:

  • Model load key: nvidia_tao

Please make sure to use this as the key for all TLT commands that require a model load key.


Gray Images of 1 X 32 X 100 (C H W)


characters id sequence.

Instructions to use the model with TAO

In order to use these models as pretrained weights for transfer learning, please use the snippet below as a template for the model component of the experiment spec file to train a OCRNet model. For more information on the experiment spec file, please refer to the TAO Toolkit User Guide.

  TPS: True
  backbone: ResNet
  feature_channel: 512
  sequence: BiLSTM
  hidden_size: 256
  prediction: CTC
  quantize: False

Instructions to deploy the model with DeepStream

To create the entire end-to-end video analytic application, deploy this model with DeepStream SDK. DeepStream SDK is a streaming analytic toolkit to accelerate building AI-based video analytic applications. DeepStream supports direct integration of this model into the deepstream sample app.

To deploy this model with DeepStream, please follow the instructions in this [documentation](@TODO: url to nvOCDR_ds.rst).


Restricted usage in different fields:

NVIDIA OCRNet model is trained on Uber Text and TextOCR. In Uber Text, all the images are stree view images. And in TextOCR, most of images are with text in various scenes. In general, to get better accuracy in specific field, more data is needed to finetune the pretrained model through TAO Toolkit.

Model versions:

  • trainable_v1.0 - Pre-trained models for finetune.
  • deployable_v1.0 - Models deployable to deepstream.



  • Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., ... & Lee, H. (2019). What is wrong with scene text recognition model comparisons? dataset and model analysis. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4715-4723).
  • Zhang, Y., Gueguen, L., Zharkov, I., Zhang, P., Seifert, K., & Kadlec, B. (2017, July). Uber-text: A large-scale dataset for optical character recognition from street-level imagery. In SUNw: Scene Understanding Workshop-CVPR (Vol. 2017, p. 5).
  • Singh, A., Pang, G., Toh, M., Huang, J., Galuba, W., & Hassner, T. (2021). Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8802-8812)
  • Graves, Alex, et al. "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks." In: Proceedings of the 23rd international conference on Machine learning (2006)
  • He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: CVPR (2015)

Using TAO Pre-trained Models


License to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.

Technical blogs

Suggested reading

Ethical AI

NVIDIA OCRNet model recognizes optical characters.

NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.