NGC Catalog
CLASSIC
Welcome Guest
Models
MixNet

MixNet

For downloads and more information, please view on a desktop device.
Description
MixNet is an optical character detection model that aims to detect text in images.
Publisher
NVIDIA
Latest Version
deployable_v1.0
Modified
August 5, 2025
Size
112.63 MB

MixNet Overview

Model Overview

Description

MixNet is an optical character detection model that aims to detect text in images. It is a deep learning model designed for accurate detection of challenging scene text in natural images, particularly focusing on small and irregularly positioned text under diverse lighting and orientations. This model is ready for commercial/non-commercial use.

License/Terms of Use

License to use these models is covered by the Model EULA. By downloading the models, you accept the terms and conditions of these NVIDIA Community model license.

Deployment Geography

Global

Use Case

This model can be used in any computer vision application, which aims to detect text characters in images.

Release Date

NGC [06/30/2025]

Model Architecture

Architecture Type: Convolution Neural Network + Transformer Block. Network Architecture: This model was developed based on the MixNet. MixNet is a cutting-edge model for scene text detection, notable for its hybrid CNN-Transformer design and strong benchmark performance. Main features include FSNet(Feature Shuffle Network) and CTBlock(Central Transformer Block). FSNet serves as the backbone, introducing a novel feature shuffling strategy to exchange features across multiple scales. CTBlock exploits the 1D manifold constraint of scene text by focusing on center line features, which helps in distinguishing closely located small text better than contour-based methods.

Input

  • Input Type: Image
  • Input Formats: Red, Green, Blue (RGB)
  • Input Parameters: Two-Dimensional (2D)
  • Other Properties Related to Input: Width and height are multiples of 32.

Output

  • Input Type: Image
  • Input Formats: Red, Green, Blue (RGB)
  • Output Parameters: Two-Dimensional (2D)
  • Other Properties Related to Output: Spatial maps or coordinates indicating detected text regions.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

  • TAO 5.5.0

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Jetson
  • NVIDIA Hopper
  • NVIDIA Volta

[Preferred/Supported] Operating System(s):

  • Linux

Model versions:

  • deployable_v1.0 - Models deployable with MixNet.

Training and Evaluation Datasets

  • The total size: 77K images
  • Total number of datasets: 3 training datasets, 2 evaluation datasets
  • Dataset partition: training and evaluation are different datasets

Training Datasets

Link:

  • ICDAR15
  • Uber-Text
  • Synthetic-dataset

Data Collection Method by dataset:

  • Hybrid: Automated, Synthetic, Human

Labeling Method by dataset:

  • Hybrid: Automated, Synthetic

Properties:
The MixNet pretrained model was trained on three datasets, totaling about 77K images. One is ICDAR2015 training dataset, we oversample it to 20K images. Another is Uber-Text dataset. We fiter the train_4Kx4K and train_1Kx1K datasets to make sure each image has word-only text instead of sentence text. We get 29,992 images from train_4Kx4K dataset and 12,312 images from train_1Kx1K dataset. The third dataset is a synthetic dataset based on the SynthText background images. They are official background images which every user can generate synthetic texts on them. We generate 7,674 images which have large texts and 7,331 images which have small single texts.

Evaluation Datasets

Link:

  • ICDAR15
  • Uber-Text

Data Collection Method by dataset:

  • Hybrid: Automated, Synthetic, Human

Labeling Method by dataset:

  • Hybrid: Automated, Synthetic

Properties:
We evaluate the MixNet model on two different datasets: word-only images from Uber-Text 1Kx1K dataset, ICDAR15 test dataset.

dataset eval image numbers
ICDAR15 500
Uber-Text 7461

Performance

Methodology and KPI

In text characters detection models, the F1-score is a key performance metric that measures how well the model balances precision (the proportion of correct positive detections among all positive detections) and recall (the proportion of actual positives that are correctly detected). The KPI for the evaluation data are reported below.

model dataset F1-score
MixNet ICDAR15 86.5%
MixNet Uber-Text 88.0%
ocdnet-vit ICDAR15 85.3%
ocdnet-vit Uber-Text 86.0%
ocdnet_deformable_resnet50 Uber-Text 82.2%
ocdnet_deformable_resnet18 Uber-Text 81.1%

Inference

Acceleration Engine: TensorRT

Test Hardware:

  • A40

The inference uses FP16 precision. The inference performance runs against a deployable model with trtexec on NVIDIA A40 GPU. The data is for inference-only performance. The end-to-end performance with streaming video data might vary slightly depending on the application's use case.

model gpu precision input-size Batch-size FPS
MixNet A40 FP16 1024x1024 1 52
MixNet A40 FP16 960x960 1 59

To find more performance data about other kinds of detection models, please refer to ocdnet.

How to use this model

This model is suggested to be used with NVIDIA Hardware and Software. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. For software, this model can run with NVIDIA-Optical-Character-Detection-and-Recognition-Solution or onnxruntime.

There are two types of models provided:

  • deployable (unpruned)

The deployable model is in onnx format. The deployable models can be deployed in TensorRT and nvOCDR, or onnxruntime.

Instructions to use the model with nvOCDR

Please refer to the C++ Sample.

Limitations

Restricted usage in different fields:

The model was trained on the ICDAR2015, Uber-text, and augmented SynthText datasets. Its generalization performance may be inadequate in scenarios that differ significantly from the training data, such as images containing numerous small text elements on PCB boards. In these cases, the model may struggle to accurately detect text. To address this limitation, augmenting the training dataset with PCB-specific images can enhance the model's ability to generalize to such challenging scenarios. Generally, achieving a better F1-score in a specific domain requires more data.

Reference

Citations

  • Zeng, Yu-Xiang, et al. "MixNet: Toward Accurate Detection of Challenging Scene Text in the Wild." arXiv preprint arXiv:2308.12817, 2023.

Using TAO Pre-trained Models

  • Get TAO Container
  • Get other purpose-built models from the NGC model registry:
    • TrafficCamNet
    • PeopleNet
    • PeopleNet
    • PeopleNet-Transformer
    • DashCamNet
    • FaceDetectIR
    • VehicleMakeNet
    • VehicleTypeNet
    • PeopleSegNet
    • PeopleSemSegNet
    • License Plate Detection
    • License Plate Recognition
    • Gaze Estimation
    • Facial Landmark
    • Heart Rate Estimation
    • Gesture Recognition
    • Emotion Recognition
    • FaceDetect
    • 2D Body Pose Estimation
    • ActionRecognitionNet
    • ActionRecognitionNet
    • PoseClassificationNet
    • People ReIdentification
    • PointPillarNet
    • CitySegFormer
    • Retail Object Detection
    • Retail Object Embedding
    • Optical Inspection
    • Optical Character Detection
    • Optical Character Recognition
    • PCB Classification
    • PeopleSemSegFormer
    • LPDNet
    • License Plate Recognition
    • Gaze Estimation
    • Facial Landmark
    • Heart Rate Estimation
    • Gesture Recognition
    • Emotion Recognition
    • FaceDetect
    • 2D Body Pose Estimation
    • ActionRecognitionNet
    • ActionRecognitionNet
    • PoseClassificationNet
    • People ReIdentification
    • PointPillarNet
    • CitySegFormer
    • Retail Object Detection
    • Retail Object Embedding
    • Optical Inspection
    • Optical Character Detection
    • Optical Character Recognition
    • PCB Classification
    • PeopleSemSegFormer

Technical blogs

  • Access the latest in Vision AI development workflows with NVIDIA TAO Toolkit 5.0
  • Improve accuracy and robustness of vision ai models with vision transformers and NVIDIA TAO
  • Train like a ‘pro’ without being an AI expert using TAO AutoML
  • Create Custom AI models using NVIDIA TAO Toolkit with Azure Machine Learning
  • Developing and Deploying AI-powered Robots with NVIDIA Isaac Sim and NVIDIA TAO
  • Learn endless ways to adapt and supercharge your AI workflows with TAO - Whitepaper
  • Customize Action Recognition with TAO and deploy with DeepStream
  • Read the 2 part blog on training and optimizing 2D body pose estimation model with TAO - Part 1 | Part 2
  • Learn how to train real-time License plate detection and recognition app with TAO and DeepStream.
  • Model accuracy is extremely important, learn how you can achieve state of the art accuracy for classification and object detection models using TAO

Suggested reading

  • More information on about TAO Toolkit and pre-trained models can be found at the NVIDIA Developer Zone
  • Read the TAO getting Started guide and release notes.
  • If you have any questions or feedback, please refer to the discussions on TAO Toolkit Developer Forums
  • Deploy your model on the edge using DeepStream. Learn more about DeepStream SDK https://developer.nvidia.com/deepstream-sdk

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.