The model described in this card is optical character recognition network, which aims to recognize characters from the gray images. One pretrained OCRNet model is delivered, which is trained on Uber-Text and TextOCR dataset with alphanumeric labels.
This model is a sequence classification model with a ResNet50 backbone and TPS module. And it will take the gray image as network input and produce sequence output.
The training algorithm optimizes the network to minimize the connectionist temporal classification (CTC) loss between a ground truth characters sequence of a text image and a predicted characters sequence. Then characters will be decoded from the sequence output of the model through best path decoding method (greedy decoding).
OCRNet pretrained model was trained on Uber-Text and TextOCR dataset. The Uber-Text contains street-level images collected from car mounted sensors and truths annotated by a team of image analysts. We choosed train_1Kx1K, train_4Kx4K and val_4Kx4K to as part of training dataset and val_1Kx1K as part of validation dataset. The TextOCR is the images with annotated texts from OpenImages dataset. After collecting the original data from Uber-text and TextOCR, We remove all the text images with *
label in Uber-text and only keep alphanumeric text images with the maximum length is 25 in both datasets. We finally construct the dataset with 805007 text images for training and 24388 images for validation.
Characters distribution:
character | number |
---|---|
0 | 66593 |
1 | 78427 |
2 | 57371 |
3 | 41161 |
4 | 35940 |
5 | 38532 |
6 | 29962 |
7 | 32832 |
8 | 25638 |
9 | 24722 |
a | 266112 |
b | 58961 |
c | 113112 |
d | 109646 |
e | 338070 |
f | 63478 |
g | 67516 |
h | 104027 |
i | 213779 |
j | 10182 |
k | 36094 |
l | 144891 |
m | 86323 |
n | 202957 |
o | 224892 |
p | 74268 |
q | 5241 |
r | 203800 |
s | 186173 |
t | 221474 |
u | 87616 |
v | 35857 |
w | 43865 |
x | 12512 |
y | 52413 |
z | 9849 |
Character length distribution
character length | number |
---|---|
1 | 94941 |
2 | 120952 |
3 | 146410 |
4 | 146889 |
5 | 82595 |
6 | 67097 |
7 | 55711 |
8 | 37333 |
9 | 23728 |
10 | 14186 |
11 | 7803 |
12 | 3892 |
13 | 1990 |
14 | 708 |
15 | 352 |
16 | 157 |
17 | 101 |
18 | 62 |
19 | 32 |
20 | 18 |
21 | 11 |
22 | 14 |
23 | 11 |
24 | 9 |
25 | 5 |
The data format must be in the following format.
/Dataset_01
/images
0000.jpg
0001.jpg
0002.jpg
...
...
...
N.jpg
/gt_list.txt
/characters_list.txt
Each image is with one line of text. gt_list.txt
contains all the ground truth text for the images
and each image and its corresponding text will take one line as:
0000.jpg abc
0001.jpg defg
0002.jpg zxv
...
There is a characters_list.txt
which has all the characters found in dataset. Each character takes one line.
We test the OCRNet model on three different datasets: Uber-Text + TextOCR validation dataset, ICDAR13 and ICDAR15 scene text benchmark dataset.
The key performance indicator is the accuracy of character recognition. The accurate recognition means all the characters in a text area are recognized correctly.The KPI for the evaluation data are reported below.
model | dataset | accuracy |
---|---|---|
ocrnet_resnet50_unpruned | Uber-Text + TextOCR validation | 77.1% |
ocrnet_resnet50_unpruned | ICDAR13 | 91.8% |
ocrnet_resnet50_unpruned | ICDAR15 | 78.6% |
ocrnet_resnet50_pruned | ICDAR13 | 92.6% |
The inference uses FP16 precision. The inference performance runs with trtexec
on AGX Orin, Orin NX, Orin Nano and NVIDIA L4 GPU. The Jetson devices run at Max-N configuration for maximum system performance. The data is the inference only performance. The end-to-end performance with streaming video data might slightly vary depending on use cases of applications.
Model | Device | precision | batch_size | FPS |
---|---|---|---|---|
ocrnet_resnet50_pruned | Orin Nano | FP16 | 128 | 981 |
ocrnet_resnet50_pruned | Orin NX | FP16 | 128 | 1399 |
ocrnet_resnet50_pruned | AGX Orin | FP16 | 128 | 3921 |
ocrnet_resnet50_pruned | L4 | FP16 | 512 | 6404 |
This model needs to be used with NVIDIA Hardware and Software. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. This model can only be used with TAO Toolkit (TAO), DeepStream SDK or TensorRT.
Primary use case intended for this model is to recognize the characters from the detected text region.
There are two types of models provided:
The trainable or unpruned
models intended for training using TAO Toolkit and the user's own dataset. This can provide high fidelity models that are adapted to the use case. The Jupyter notebook available as a part of TAO container can be used to re-train.
The deployable
models share the same structure as the trainable or unpruned
model but in onnx
format. The deployable
models can be deployed in TensorRT, nvOCDR and DeepStream.
The trainable
models are encrypted and can be decrypted with the following key:
nvidia_tao
Please make sure to use this as the key for all TLT commands that require a model load key.
Gray Images of 1 X 32 X 100 (C H W)
characters id sequence.
In order to use these models as pretrained weights for transfer learning, please use the snippet below as a template for the model
component of the experiment spec file to train a OCRNet model. For more information on the experiment spec file, please refer to the TAO Toolkit User Guide.
model:
TPS: True
backbone: ResNet
feature_channel: 512
sequence: BiLSTM
hidden_size: 256
prediction: CTC
quantize: False
To create the entire end-to-end video analytic application, deploy this model with DeepStream SDK. DeepStream SDK is a streaming analytic toolkit to accelerate building AI-based video analytic applications. DeepStream supports direct integration of this model into the deepstream sample app.
To deploy this model with DeepStream, please follow the instructions in this [documentation](@TODO: url to nvOCDR_ds.rst).
NVIDIA OCRNet model is trained on Uber Text and TextOCR. In Uber Text, all the images are stree view images. And in TextOCR, most of images are with text in various scenes. In general, to get better accuracy in specific field, more data is needed to finetune the pretrained model through TAO Toolkit.
License to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.
NVIDIA OCRNet model recognizes optical characters.
NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.