NGC | Catalog
CatalogModelsOptical Character Recognition

Optical Character Recognition

Logo for Optical Character Recognition
Description
Model to recognise characters from a preceding OCDNet model.
Publisher
NVIDIA
Latest Version
deployable_v2.0
Modified
December 12, 2023
Size
44.39 MB

OCRNet Model Card

Model Overview

The model described in this card is optical character recognition network, which aims to recognize characters from the gray images. One pretrained OCRNet model is delivered, which is trained on Uber-Text and TextOCR dataset with alphanumeric labels.

Model Architecture

This model is a sequence classification model with a ResNet50 backbone and TPS module. And it will take the gray image as network input and produce sequence output.

Training

The training algorithm optimizes the network to minimize the connectionist temporal classification (CTC) loss between a ground truth characters sequence of a text image and a predicted characters sequence. Then characters will be decoded from the sequence output of the model through best path decoding method (greedy decoding).

Training Data

OCRNet pretrained model was trained on Uber-Text and TextOCR dataset. The Uber-Text contains street-level images collected from car mounted sensors and truths annotated by a team of image analysts. We choosed train_1Kx1K, train_4Kx4K and val_4Kx4K to as part of training dataset and val_1Kx1K as part of validation dataset. The TextOCR is the images with annotated texts from OpenImages dataset. After collecting the original data from Uber-text and TextOCR, We remove all the text images with * label in Uber-text and only keep alphanumeric text images with the maximum length is 25 in both datasets. We finally construct the dataset with 805007 text images for training and 24388 images for validation.

  • Characters distribution:

    character number
    0 66593
    1 78427
    2 57371
    3 41161
    4 35940
    5 38532
    6 29962
    7 32832
    8 25638
    9 24722
    a 266112
    b 58961
    c 113112
    d 109646
    e 338070
    f 63478
    g 67516
    h 104027
    i 213779
    j 10182
    k 36094
    l 144891
    m 86323
    n 202957
    o 224892
    p 74268
    q 5241
    r 203800
    s 186173
    t 221474
    u 87616
    v 35857
    w 43865
    x 12512
    y 52413
    z 9849
  • Character length distribution

    character length number
    1 94941
    2 120952
    3 146410
    4 146889
    5 82595
    6 67097
    7 55711
    8 37333
    9 23728
    10 14186
    11 7803
    12 3892
    13 1990
    14 708
    15 352
    16 157
    17 101
    18 62
    19 32
    20 18
    21 11
    22 14
    23 11
    24 9
    25 5

Data Format

The data format must be in the following format.

/Dataset_01
    /images
        0000.jpg
        0001.jpg
        0002.jpg
        ...
        ...
        ...
        N.jpg
/gt_list.txt
/characters_list.txt

Each image is with one line of text. gt_list.txt contains all the ground truth text for the images and each image and its corresponding text will take one line as:

0000.jpg abc
0001.jpg defg
0002.jpg zxv
...

There is a characters_list.txt which has all the characters found in dataset. Each character takes one line.

Performance

Evaluation Data

We test the OCRNet model on three different datasets: Uber-Text + TextOCR validation dataset, ICDAR13 and ICDAR15 scene text benchmark dataset.

Methodology and KPI

The key performance indicator is the accuracy of character recognition. The accurate recognition means all the characters in a text area are recognized correctly.The KPI for the evaluation data are reported below.

model dataset accuracy
ocrnet_resnet50_unpruned Uber-Text + TextOCR validation 77.1%
ocrnet_resnet50_unpruned ICDAR13 91.8%
ocrnet_resnet50_unpruned ICDAR15 78.6%
ocrnet_resnet50_unpruned Internal PCB validation 74.1%
ocrnet_resnet50_pruned ICDAR13 92.6%
ocrnet-vit Uber-Text + TextOCR validation 83.7%
ocrnet-vit ICDAR13 95.5%
ocrnet-vit ICDAR15 84.7%
ocrnet-vit-pcb Internal PCB validation 84.2%

Real-time Inference Performance

The inference uses FP16 precision. The inference performance runs with trtexec on AGX Orin, Orin NX, Orin Nano and NVIDIA L4 GPU. The Jetson devices run at Max-N configuration for maximum system performance. The data is the inference only performance. The end-to-end performance with streaming video data might slightly vary depending on use cases of applications.

Model Device precision batch_size FPS
ocrnet_resnet50_pruned Orin Nano FP16 128 981
ocrnet_resnet50_pruned Orin NX FP16 128 1399
ocrnet_resnet50_pruned AGX Orin FP16 128 3921
ocrnet_resnet50_pruned L4 FP16 512 6404
ocrnet-vit Orin Nano FP16+INT8 4 100
ocrnet-vit Orin NX FP16+INT8 4 147
ocrnet-vit AGX Orin FP16+INT8 8 393
ocrnet-vit T4 FP32 16 680
ocrnet-vit A2 FP32 32 428
ocrnet-vit A30 FP32 32 1645
ocrnet-vit L4 FP32 32 1471
ocrnet-vit L40 FP32 32 4320
ocrnet-vit A100 FP32 32 2513
ocrnet-vit H100 FP32 32 3846

How to use this model

This model needs to be used with NVIDIA Hardware and Software. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. This model can only be used with TAO Toolkit (TAO), DeepStream SDK or TensorRT.

Primary use case intended for this model is to recognize the characters from the detected text region.

There are two types of models provided:

  • trainable (unpruned)
  • deployable (unpruned)

The trainable or unpruned models intended for training using TAO Toolkit and the user's own dataset. This can provide high fidelity models that are adapted to the use case. The Jupyter notebook available as a part of TAO container can be used to re-train.

The deployable models share the same structure as the trainable or unpruned model but in onnx format. The deployable models can be deployed in TensorRT, nvOCDR and DeepStream.

The trainable models are encrypted and can be decrypted with the following key:

  • Model load key: nvidia_tao

Please make sure to use this as the key for all TLT commands that require a model load key.

Input

  • Gray Images of 1 X 32 X 100 (C H W) for trainable_v1.0/deployable_v1.0
  • Gray Images of 1 X 64 X 200 (C H W) for trainable_v2.0/trainable_v2.1/deployable_v2.0/deployable_v2.1

Output

characters id sequence.

Instructions to use the model with TAO

In order to use these models as pretrained weights for transfer learning, please use the snippet below as a template for the model component of the experiment spec file to train a OCRNet model. For more information on the experiment spec file, please refer to the TAO Toolkit User Guide.

To use trainable_v1.0 model:

model:
  TPS: True
  backbone: ResNet
  feature_channel: 512
  sequence: BiLSTM
  hidden_size: 256
  prediction: CTC
  quantize: False
  input_width: 100                                                                  
  input_height: 32                                                                  
  input_channel: 1

To use trainable_v2.0/trainable_v2.1 model:

model:
  TPS: True
  backbone: FAN_tiny_2x
  sequence: BiLSTM
  hidden_size: 256
  prediction: Attn
  quantize: False
  input_width: 200                                                                  
  input_height: 64                                                                  
  input_channel: 1

Instructions to deploy the model with DeepStream

To create the entire end-to-end video analytic application, deploy this model with DeepStream SDK. DeepStream SDK is a streaming analytic toolkit to accelerate building AI-based video analytic applications. DeepStream supports direct integration of this model into the deepstream sample app.

To deploy this model with DeepStream, follow the instructions in Deploying nvOCDR to DeepStream.

Limitations

Restricted usage in different fields:

NVIDIA OCRNet model is trained on Uber Text, TextOCR, and PCB text dataset. In Uber Text, all the images are street view images. And in TextOCR, most of images are with text in various scenes. In general, to get better accuracy in a specific field, more data is needed to finetune the pretrained model through TAO Toolkit.

Model versions:

  • trainable_v1.0 - Pre-trained model with ResNet backbone on scene text.
  • deployable_v1.0 - Models deployable with ResNet backbone.
  • trainable_v2.0 - Pre-trained model with FAN backbone on scene text.
  • deployable_v2.0 - Model depolyable with FAN backbone on scene text.
  • trainable_v2.1 - Pre-trained model with FAN backbone on PCB text.
  • deployable_v2.1 - Model depolyable with FAN backbone on PCB text.

Reference

Citations

  • Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., ... & Lee, H. (2019). What is wrong with scene text recognition model comparisons? dataset and model analysis. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4715-4723).
  • Zhang, Y., Gueguen, L., Zharkov, I., Zhang, P., Seifert, K., & Kadlec, B. (2017, July). Uber-text: A large-scale dataset for optical character recognition from street-level imagery. In SUNw: Scene Understanding Workshop-CVPR (Vol. 2017, p. 5).
  • Singh, A., Pang, G., Toh, M., Huang, J., Galuba, W., & Hassner, T. (2021). Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8802-8812)
  • Graves, Alex, et al. "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks." In: Proceedings of the 23rd international conference on Machine learning (2006)
  • He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: CVPR (2015)
  • Zhou, D., Yu, Z., Xie, E., Xiao, C., Anandkumar, A., Feng, J., & Alvarez, J. M. (2022, June). Understanding the robustness in vision transformers. In International Conference on Machine Learning (pp. 27378-27394). PMLR.
  • Kuo, C. W., Ashmore, J. D., Huggins, D., & Kira, Z. (2019, January). Data-efficient graph embedding learning for PCB component detection. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 551-560). IEEE.

Using TAO Pre-trained Models

License

License to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.

Technical blogs

Suggested reading

Ethical AI

NVIDIA OCRNet model recognizes optical characters.

NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.