This model card combines the relevant information of OCR and OCD models
Optical character recognition network recognizes characters from the gray images.
License to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.
Architecture Type: Convolution Neural Network (CNN)
Network Architecture: ResNet50
Model Version:
Input Type(s): Image
Input Format: Gray Image
Input Parameters: 3D
Other Properties Related to Input:
Output Type(s): Sequence of characters
Output Format: Character Id sequence: Text String(s)
Other Properties Related to Output: None
Runtime(s): NVIDIA AI Enterprise
Toolkit: TAO Framework
Supported Hardware Platform(s): Ampere, Jetson, Hopper, Lovelace, Pascal, Turing
Supported Operating System(s): Linux
OCRNet pretrained model was trained on Uber-Text and TextOCR dataset. The Uber-Text contains street-level images collected from car mounted sensors and truths annotated by a team of image analysts. The TextOCR is the images with annotated texts from OpenImages dataset. After collecting the original data from Uber-text and TextOCR, we remove all the text images with *
label in Uber-text and only keep alphanumeric text images with the maximum length is 25 in both datasets. We finally construct the dataset with 805007 text images for training and 24388 images for validation.
Engine: TensorRT
Test Hardware:
The model described in this card is an optical characters detection network, which aims to detect text in images. Trainable and deployable OCDNet models are provided. These are trained on Uber-Text dataset and ICDAR2015 dataset respectively.
License to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.
This model is based on a relatively sophisticated text detection network called DBNet. DBNet is a network architecture for real-time scene text detection with differentiable binarization. It aims to solve the problem of text localization and segmentation in natural images with complex backgrounds and various text shapes.
The training algorithm inserts the binarization operation into the segmentation network and jointly optimizes it so that the network can learn to separate foreground and background pixels more effectively. The binarization threshold is learned by minimizing the IoU loss between the predicted binary map and the ground truth binary map.
The trainable models were trained on the Uber-Text dataset and ICDAR2015 dataset respectively. The Uber-Text dataset contains street-level images collected from car mounted sensors and truths annotated by a team of image analysts--including train_4Kx4K, train_1Kx1K, val_4Kx4K, val_1Kx1K, test_4Kx4K as the training datasets and test_1Kx1K as the validation dataset. The dataset was constructed with 107812 images for training and 10157 images for validation. The ICDAR2015 dataset contains 1000 training images and 500 test images. The deployable models were ONNX models that were exported using the trainable models.
We test the OCRNet model on three different datasets: Uber-Text + TextOCR validation dataset, ICDAR13 and ICDAR15 scene text benchmark dataset.
The key performance indicator is the accuracy of character recognition. The accurate recognition means all the characters in a text area are recognized correctly.The KPI for the evaluation data are reported below.
model | dataset | accuracy |
---|---|---|
ocrnet_resnet50_unpruned | Uber-Text + TextOCR validation | 77.1% |
ocrnet_resnet50_unpruned | ICDAR13 | 91.8% |
ocrnet_resnet50_unpruned | ICDAR15 | 78.6% |
ocrnet_resnet50_unpruned | Internal PCB validation | 74.1% |
ocrnet_resnet50_pruned | ICDAR13 | 92.6% |
ocrnet-vit | Uber-Text + TextOCR validation | 83.7% |
ocrnet-vit | ICDAR13 | 95.5% |
ocrnet-vit | ICDAR15 | 84.7% |
ocrnet-vit-pcb | Internal PCB validation | 84.2% |
The inference uses FP16 precision. The inference performance runs with trtexec
on AGX Orin, Orin NX, Orin Nano and NVIDIA L4 GPU. The Jetson devices run at Max-N configuration for maximum system performance. The data is the inference only performance. The end-to-end performance with streaming video data might slightly vary depending on use cases of applications.
Model | Device | precision | batch_size | FPS |
---|---|---|---|---|
ocrnet_resnet50_pruned | Orin Nano | FP16 | 128 | 981 |
ocrnet_resnet50_pruned | Orin NX | FP16 | 128 | 1399 |
ocrnet_resnet50_pruned | AGX Orin | FP16 | 128 | 3921 |
ocrnet_resnet50_pruned | L4 | FP16 | 512 | 6404 |
ocrnet-vit | Orin Nano | FP16+INT8 | 4 | 100 |
ocrnet-vit | Orin NX | FP16+INT8 | 4 | 147 |
ocrnet-vit | AGX Orin | FP16+INT8 | 8 | 393 |
ocrnet-vit | T4 | FP32 | 16 | 680 |
ocrnet-vit | A2 | FP32 | 32 | 428 |
ocrnet-vit | A30 | FP32 | 32 | 1645 |
ocrnet-vit | L4 | FP32 | 32 | 1471 |
ocrnet-vit | L40 | FP32 | 32 | 4320 |
ocrnet-vit | A100 | FP32 | 32 | 2513 |
ocrnet-vit | H100 | FP32 | 32 | 3846 |
This model needs to be used with NVIDIA Hardware and Software: The model can run on any NVIDIA GPU, including NVIDIA Jetson devices, with TAO Toolkit, DeepStream SDK or TensorRT.
The primary use case for this model is to detect text on images.
There are two types of models provided (both unpruned).
The trainable
models are intended for training with the user's own dataset using TAO Toolkit. This can provide high-fidelity models that are adapted to the use case. A Jupyter notebook is available as a part of the TAO container and can be used to re-train.
The deployable
models share the same structure as the trainable
model, but in onnx
format. The deployable
models can be deployed using TensorRT, nvOCDR, and DeepStream.
Images of C x H x W (H and W should be multiples of 32.)
BBox or polygon coordinates for each detected text in the input image
To use these models as pretrained weights for transfer learning, use the snippet below as a template for the model
component of the experiment spec file to train an OCDNet model. For more information on the experiment spec file, refer to the TAO Toolkit User Guide.
To use trainable_resnet18_v1.0 model:
model:
load_pruned_graph: False
pruned_graph_path: '/results/prune/pruned_0.1.pth'
pretrained_model_path: '/data/ocdnet/ocdnet_deformable_resnet18.pth'
backbone: deformable_resnet18
To use trainable_ocdnet_vit_v1.0 model:
model:
load_pruned_graph: False
pruned_graph_path: '/results/prune/pruned_0.1.pth'
pretrained_model_path: '/data/ocdnet/ocdnet_fan_tiny_2x_icdar.pth'
backbone: fan_tiny_8_p4_hybrid
enlarge_feature_map_size: True
activation_checkpoint: True
To create the entire end-to-end video analytic application, deploy this model with DeepStream SDK. DeepStream SDK is a streaming analytic toolkit to accelerate building AI-based video analytic applications. DeepStream supports direct integration of this model into the Deepstream sample app.
To deploy this model with DeepStream, follow these instructions.
The NVIDIA OCDNet trainable model is trained on Uber Text, ICDAR2015 and PCB text dataset, which contains street-view images only. To get better accuracy in a specific field, more data is usually required to fine tune the pre-trained model with TAO Toolkit.
The license to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.
The NVIDIA OCDNet model detects optical characters.
NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developers to ensure that it meets the requirements for the relevant industry and use case, that the necessary instructions and documentation are provided to understand error rates, confidence intervals, and results, and that the model is being used under the conditions and in the manner intended.
Please report security vulnerabilities or NVIDIA AI Concerns here