# MixNet Overview ## Model Overview ### Description MixNet is an optical character detection model that aims to detect text in images. It is a deep learning model designed for accurate detection of challenging scene text in natural images, particularly focusing on small and irregularly positioned text under diverse lighting and orientations. This model is ready for commercial/non-commercial use. ## License/Terms of Use License to use these models is covered by the NVIDIA Open Model License. By downloading the model, you accept the terms and conditions of these [licenses](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). ## Deployment Geography Global ## Use Case This model can be used in any computer vision application, which aims to detect text characters in images. ## Release Date NGC [06/30/2025] ## Model Architecture **Architecture Type:** Convolution Neural Network + Transformer Block. **Network Architecture:** This model was developed based on the [MixNet](https://github.com/D641593/MixNet). MixNet is a cutting-edge model for scene text detection, notable for its hybrid CNN-Transformer design and strong benchmark performance. Main features include FSNet(Feature Shuffle Network) and CTBlock(Central Transformer Block). FSNet serves as the backbone, introducing a novel feature shuffling strategy to exchange features across multiple scales. CTBlock exploits the 1D manifold constraint of scene text by focusing on center line features, which helps in distinguishing closely located small text better than contour-based methods. ### Input - **Input Type:** Image - **Input Formats:** Red, Green, Blue (RGB) - **Input Parameters:** Two-Dimensional (2D) - **Other Properties Related to Input:** Width and height are multiples of 32. ### Output - **Input Type:** Image - **Input Formats:** Red, Green, Blue (RGB) - **Output Parameters:** Two-Dimensional (2D) - **Other Properties Related to Output:** Spatial maps or coordinates indicating detected text regions. Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. ## Software Integration: ### Runtime Engine(s): - TAO 5.5.0 ### Supported Hardware Microarchitecture Compatibility: - NVIDIA Ampere - NVIDIA Jetson - NVIDIA Hopper - NVIDIA Volta ### [Preferred/Supported] Operating System(s): - Linux ## Model versions: - **deployable_v1.0** - Models deployable with MixNet. ## Training and Evaluation Datasets - **The total size:** 77K images - **Total number of datasets:** 3 training datasets, 2 evaluation datasets - **Dataset partition:** training and evaluation are different datasets ### Training Datasets **Link:** * [ICDAR15](https://rrc.cvc.uab.es/?ch=4) * [Uber-Text](https://s3-us-west-2.amazonaws.com/uber-common-public/ubertext/index.html) * [Synthetic-dataset](https://github.com/ankush-me/SynthText#pre-processed-background-images) **Data Collection Method by dataset:**
* Hybrid: Automated, Synthetic, Human
**Labeling Method by dataset:**
* Hybrid: Automated, Synthetic
**Properties:**
The MixNet pretrained model was trained on three datasets, totaling about 77K images. One is ICDAR2015 training dataset, we oversample it to 20K images. Another is Uber-Text dataset. We fiter the train_4Kx4K and train_1Kx1K datasets to make sure each image has word-only text instead of sentence text. We get 29,992 images from train_4Kx4K dataset and 12,312 images from train_1Kx1K dataset. The third dataset is a synthetic dataset based on the [SynthText background images](https://github.com/ankush-me/SynthText#pre-processed-background-images). They are official background images which every user can generate synthetic texts on them. We generate 7,674 images which have large texts and 7,331 images which have small single texts. ### Evaluation Datasets **Link:** * [ICDAR15](https://rrc.cvc.uab.es/?ch=4) * [Uber-Text](https://s3-us-west-2.amazonaws.com/uber-common-public/ubertext/index.html) **Data Collection Method by dataset:**
* Hybrid: Automated, Synthetic, Human
**Labeling Method by dataset:**
* Hybrid: Automated, Synthetic
**Properties:**
We evaluate the MixNet model on two different datasets: word-only images from Uber-Text 1Kx1K dataset, ICDAR15 test dataset. |dataset|eval image numbers |:---:|:---:| |ICDAR15|500 |Uber-Text|7461 ## Performance ### Methodology and KPI In text characters detection models, the F1-score is a key performance metric that measures how well the model balances precision (the proportion of correct positive detections among all positive detections) and recall (the proportion of actual positives that are correctly detected). The KPI for the evaluation data are reported below. |model|dataset|F1-score| |---|---|---| |MixNet|ICDAR15|86.5%| |MixNet|Uber-Text|88.0%| |ocdnet-vit|ICDAR15|85.3%| |ocdnet-vit|Uber-Text|86.0%| |ocdnet_deformable_resnet50|Uber-Text|82.2%| |ocdnet_deformable_resnet18|Uber-Text|81.1%| ### Inference **Acceleration Engine:** TensorRT
**Test Hardware:**
- A40 The inference uses FP16 precision. The inference performance runs against a deployable model with trtexec on NVIDIA A40 GPU. The data is for inference-only performance. The end-to-end performance with streaming video data might vary slightly depending on the application's use case. |model|gpu|precision|input-size|Batch-size|FPS| |---|---|---|---|---|---| |MixNet|A40|FP16|1024x1024|1|52| |MixNet|A40|FP16|960x960|1|59| To find more performance data about other kinds of detection models, please refer to [ocdnet](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/ocdnet). ## How to use this model This model is suggested to be used with NVIDIA Hardware and Software. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. For software, this model can run with [NVIDIA-Optical-Character-Detection-and-Recognition-Solution](https://github.com/NVIDIA-AI-IOT/NVIDIA-Optical-Character-Detection-and-Recognition-Solution) or onnxruntime. There are two types of models provided: - deployable (unpruned) The `deployable` model is in `onnx` format. The `deployable` models can be deployed in TensorRT and nvOCDR, or onnxruntime. ### Instructions to use the model with nvOCDR Please refer to the [C++ Sample](https://github.com/NVIDIA-AI-IOT/NVIDIA-Optical-Character-Detection-and-Recognition-Solution/tree/tao-5.5-ocr-v3). ## Limitations ### Restricted usage in different fields: The model was trained on the ICDAR2015, Uber-text, and augmented SynthText datasets. Its generalization performance may be inadequate in scenarios that differ significantly from the training data, such as images containing numerous small text elements on PCB boards. In these cases, the model may struggle to accurately detect text. To address this limitation, augmenting the training dataset with PCB-specific images can enhance the model's ability to generalize to such challenging scenarios. Generally, achieving a better F1-score in a specific domain requires more data. ## Reference ### Citations - Zeng, Yu-Xiang, et al. "MixNet: Toward Accurate Detection of Challenging Scene Text in the Wild." arXiv preprint arXiv:2308.12817, 2023. ## Using TAO Pre-trained Models - Get [TAO Container](https://ngc.nvidia.com/catalog/containers/nvidia:tao:tao-toolkit) - Get other purpose-built models from the NGC model registry: - [TrafficCamNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:trafficcamnet) - [PeopleNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:peoplenet) - [PeopleNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:peoplenet) - [PeopleNet-Transformer](https://ngc.nvidia.com/catalog/models/nvidia:tao:peoplenet_transformer) - [DashCamNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:dashcamnet) - [FaceDetectIR](https://ngc.nvidia.com/catalog/models/nvidia:tao:facedetectir) - [VehicleMakeNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:vehiclemakenet) - [VehicleTypeNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:vehicletypenet) - [PeopleSegNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:peoplesegnet) - [PeopleSemSegNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:peoplesemsegnet) - [License Plate Detection](https://ngc.nvidia.com/catalog/models/nvidia:tao:lpdnet) - [License Plate Recognition](https://ngc.nvidia.com/catalog/models/nvidia:tao:lprnet) - [Gaze Estimation](https://ngc.nvidia.com/catalog/models/nvidia:tao:gazenet) - [Facial Landmark](https://ngc.nvidia.com/catalog/models/nvidia:tao:fpenet) - [Heart Rate Estimation](https://ngc.nvidia.com/catalog/models/nvidia:tao:heartratenet) - [Gesture Recognition](https://ngc.nvidia.com/catalog/models/nvidia:tao:gesturenet) - [Emotion Recognition](https://ngc.nvidia.com/catalog/models/nvidia:tao:emotionnet) - [FaceDetect](https://ngc.nvidia.com/catalog/models/nvidia:tao:facenet) - [2D Body Pose Estimation](https://ngc.nvidia.com/catalog/models/nvidia:tao:bodyposenet) - [ActionRecognitionNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:actionrecognitionnet) - [ActionRecognitionNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:actionrecognitionnet) - [PoseClassificationNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:poseclassificationnet) - [People ReIdentification](https://ngc.nvidia.com/catalog/models/nvidia:tao:reidentificationnet) - [PointPillarNet](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/pointpillarnet) - [CitySegFormer](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/citysemsegformer) - [Retail Object Detection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/retail_object_detection) - [Retail Object Embedding](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/retail_object_recognition) - [Optical Inspection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/optical_inspection) - [Optical Character Detection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/ocdnet) - [Optical Character Recognition](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/ocrnet) - [PCB Classification](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/pcb_classification) - [PeopleSemSegFormer](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/peoplesemsegformer) - [LPDNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:lpdnet) - [License Plate Recognition](https://ngc.nvidia.com/catalog/models/nvidia:tao:lprnet) - [Gaze Estimation](https://ngc.nvidia.com/catalog/models/nvidia:tao:gazenet) - [Facial Landmark](https://ngc.nvidia.com/catalog/models/nvidia:tao:fpenet) - [Heart Rate Estimation](https://ngc.nvidia.com/catalog/models/nvidia:tao:heartratenet) - [Gesture Recognition](https://ngc.nvidia.com/catalog/models/nvidia:tao:gesturenet) - [Emotion Recognition](https://ngc.nvidia.com/catalog/models/nvidia:tao:emotionnet) - [FaceDetect](https://ngc.nvidia.com/catalog/models/nvidia:tao:facenet) - [2D Body Pose Estimation](https://ngc.nvidia.com/catalog/models/nvidia:tao:bodyposenet) - [ActionRecognitionNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:actionrecognitionnet) - [ActionRecognitionNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:actionrecognitionnet) - [PoseClassificationNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:poseclassificationnet) - [People ReIdentification](https://ngc.nvidia.com/catalog/models/nvidia:tao:reidentificationnet) - [PointPillarNet](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/pointpillarnet) - [CitySegFormer](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/citysemsegformer) - [Retail Object Detection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/retail_object_detection) - [Retail Object Embedding](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/retail_object_recognition) - [Optical Inspection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/optical_inspection) - [Optical Character Detection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/ocdnet) - [Optical Character Recognition](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/ocrnet) - [PCB Classification](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/pcb_classification) - [PeopleSemSegFormer](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/peoplesemsegformer) ## Technical blogs - [Access the latest in Vision AI development workflows with NVIDIA TAO Toolkit 5.0](https://developer.nvidia.com/blog/access-the-latest-in-vision-ai-model-development-workflows-with-nvidia-tao-toolkit-5-0/) - [Improve accuracy and robustness of vision ai models with vision transformers and NVIDIA TAO](https://developer.nvidia.com/blog/improve-accuracy-and-robustness-of-vision-ai-apps-with-vision-transformers-and-nvidia-tao/) - [Train like a ‘pro’ without being an AI expert using TAO AutoML](https://developer.nvidia.com/blog/training-like-an-ai-pro-using-tao-automl/) - [Create Custom AI models using NVIDIA TAO Toolkit with Azure Machine Learning ](https://developer.nvidia.com/blog/creating-custom-ai-models-using-nvidia-tao-toolkit-with-azure-machine-learning/) - [Developing and Deploying AI-powered Robots with NVIDIA Isaac Sim and NVIDIA TAO](https://developer.nvidia.com/blog/developing-and-deploying-ai-powered-robots-with-nvidia-isaac-sim-and-nvidia-tao/) - Learn endless ways to adapt and supercharge your AI workflows with TAO - [Whitepaper](https://developer.nvidia.com/tao-toolkit-usecases-whitepaper/1-introduction) - [Customize Action Recognition with TAO and deploy with DeepStream](https://developer.nvidia.com/blog/developing-and-deploying-your-custom-action-recognition-application-without-any-ai-expertise-using-tao-and-deepstream/) - Read the 2 part blog on training and optimizing 2D body pose estimation model with TAO - [Part 1](https://developer.nvidia.com/blog/training-optimizing-2d-pose-estimation-model-with-tao-toolkit-part-1) | [Part 2](https://developer.nvidia.com/blog/training-optimizing-2d-pose-estimation-model-with-tao-toolkit-part-2) - Learn how to train [real-time License plate detection and recognition app](https://developer.nvidia.com/blog/creating-a-real-time-license-plate-detection-and-recognition-app) with TAO and DeepStream. - Model accuracy is extremely important, learn how you can achieve [state of the art accuracy for classification and object detection models](https://developer.nvidia.com/blog/preparing-state-of-the-art-models-for-classification-and-object-detection-with-tao-toolkit/) using TAO ## Suggested reading - More information on about TAO Toolkit and pre-trained models can be found at the [NVIDIA Developer Zone](https://developer.nvidia.com/tao-toolkit) - Read the [TAO getting Started](https://docs.nvidia.com/tao/tao-toolkit/) guide and [release notes](https://docs.nvidia.com/tao/tao-toolkit/text/release_notes.html). - If you have any questions or feedback, please refer to the discussions on [TAO Toolkit Developer Forums](https://forums.developer.nvidia.com/c/accelerated-computing/intelligent-video-analytics/tao-toolkit/17) - Deploy your model on the edge using DeepStream. Learn more about DeepStream SDK https://developer.nvidia.com/deepstream-sdk ## Ethical Considerations NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).