MixNet is an optical character detection model that aims to detect text in images. It is a deep learning model designed for accurate detection of challenging scene text in natural images, particularly focusing on small and irregularly positioned text under diverse lighting and orientations. This model is ready for commercial/non-commercial use.
License to use these models is covered by the Model EULA. By downloading the models, you accept the terms and conditions of these NVIDIA Community model license.
Global
This model can be used in any computer vision application, which aims to detect text characters in images.
NGC [06/30/2025]
Architecture Type: Convolution Neural Network + Transformer Block. Network Architecture: This model was developed based on the MixNet. MixNet is a cutting-edge model for scene text detection, notable for its hybrid CNN-Transformer design and strong benchmark performance. Main features include FSNet(Feature Shuffle Network) and CTBlock(Central Transformer Block). FSNet serves as the backbone, introducing a novel feature shuffling strategy to exchange features across multiple scales. CTBlock exploits the 1D manifold constraint of scene text by focusing on center line features, which helps in distinguishing closely located small text better than contour-based methods.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Link:
Data Collection Method by dataset:
Labeling Method by dataset:
Properties:
The MixNet pretrained model was trained on three datasets, totaling about 77K images.
One is ICDAR2015 training dataset, we oversample it to 20K images.
Another is Uber-Text dataset. We fiter the train_4Kx4K and train_1Kx1K datasets to make sure each image has word-only text instead of sentence text. We get 29,992 images from train_4Kx4K dataset and 12,312 images from train_1Kx1K dataset.
The third dataset is a synthetic dataset based on the SynthText background images. They are official background images which every user can generate synthetic texts on them. We generate 7,674 images which have large texts and 7,331 images which have small single texts.
Link:
Data Collection Method by dataset:
Labeling Method by dataset:
Properties:
We evaluate the MixNet model on two different datasets: word-only images from Uber-Text 1Kx1K dataset, ICDAR15 test dataset.
dataset | eval image numbers |
---|---|
ICDAR15 | 500 |
Uber-Text | 7461 |
In text characters detection models, the F1-score is a key performance metric that measures how well the model balances precision (the proportion of correct positive detections among all positive detections) and recall (the proportion of actual positives that are correctly detected). The KPI for the evaluation data are reported below.
model | dataset | F1-score |
---|---|---|
MixNet | ICDAR15 | 86.5% |
MixNet | Uber-Text | 88.0% |
ocdnet-vit | ICDAR15 | 85.3% |
ocdnet-vit | Uber-Text | 86.0% |
ocdnet_deformable_resnet50 | Uber-Text | 82.2% |
ocdnet_deformable_resnet18 | Uber-Text | 81.1% |
Acceleration Engine: TensorRT
Test Hardware:
The inference uses FP16 precision. The inference performance runs against a deployable model with trtexec on NVIDIA A40 GPU. The data is for inference-only performance. The end-to-end performance with streaming video data might vary slightly depending on the application's use case.
model | gpu | precision | input-size | Batch-size | FPS |
---|---|---|---|---|---|
MixNet | A40 | FP16 | 1024x1024 | 1 | 52 |
MixNet | A40 | FP16 | 960x960 | 1 | 59 |
To find more performance data about other kinds of detection models, please refer to ocdnet.
This model is suggested to be used with NVIDIA Hardware and Software. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. For software, this model can run with NVIDIA-Optical-Character-Detection-and-Recognition-Solution or onnxruntime.
There are two types of models provided:
The deployable
model is in onnx
format. The deployable
models can be deployed in TensorRT and nvOCDR, or onnxruntime.
Please refer to the C++ Sample.
The model was trained on the ICDAR2015, Uber-text, and augmented SynthText datasets. Its generalization performance may be inadequate in scenarios that differ significantly from the training data, such as images containing numerous small text elements on PCB boards. In these cases, the model may struggle to accurately detect text. To address this limitation, augmenting the training dataset with PCB-specific images can enhance the model's ability to generalize to such challenging scenarios. Generally, achieving a better F1-score in a specific domain requires more data.
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.