Retail Object Recognition

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Associated Products

Description

Retail Object Recognition

Publisher

NVIDIA

Latest Version

trainable_v2.0

Modified

March 15, 2024

Size

918.82 MB

Retail Object Recognition

Model Overview

The Retail Object Recognition model encodes retail objects to embedding vectors and predicts their labels based on the embedding vectors in the reference space. This model both generates embedding vectors in a reference space and provides a classification output to categorize the objects into seen classes, which refer to the classes that have appeared in the training datasets.

Model Architecture

The Retail Object Recognition model is composed of three primary components: a trunk, a classification head, and an embedding head. The trunk uses the NV-Dinov2 model architecture for feature extraction. NV-Dinov2 is a visual foundational model trained on NVIDIA proprietary large scale dataset. Dinov2 is a self-supervised learning method that uses a combination of two SSL techniques : DINO and iBOT. These models could greatly simplify the use of images in any system by producing all purpose visual features, i.e., features that work across image distributions and tasks without finetuning. Trained on large curated datasets, our model has learnt robust fine-grained representation useful for localization and classification tasks. This model can be used as a foundation model for a variety of downstream tasks with few labeled examples. For more details on the method please refer: Dinov2.

The embedding head uses four adaptors with an output size of 1024. The adaptor consists of two fully connected layers, GELU activation functions and residual connection. The classifier is derived from the adaptors along with a one-layer perceptron with an output size corresponding to the number of classes.

Training

For the Retail Object Recognition model, the training algorithm optimizes the network to minimize the embedding features and target classes for the objects. The trunk is initialized by the NVDinov2 pretrained model and remains frozen during the training process. The embedder is trained using the Triplet Loss algorithm, an approach that enhances the network by reducing the cosine similarity-measured distances between the anchor image and positive images, while concurrently extending the distances between the anchor image and negative images. The classifier was trained through the Cross-entropy algorithm, minimizing the distance between the images and their corresponding classes. The classifier and embedder are jointly trained that can potentially enhances the precision by leveraged on the features from both branches.

Training Data

The training data of the Retail Object Recognition model was cropped from images for Retail Object Detection model training and fine-tuning data (see Retail Object Detection - TRAINING DATA). Thus it is made up of both synthetic data and real data. By mixturing the synthetic and real images in the training set, the model is enhanced, allowing it to bridge the gap between simulation and reality. Consequently, the model can learn feature representations from both synthetic and real image sources.

The training data encompasses multiple angles of the retail objects, which equips the model to identify a retail object from any given angle.

The Retail Object Recognition model trained with a retail dataset, which total has 315 distinct categories.

Specifically, the model's training involved a combined dataset of more than 80,000 synthetic images and 48,000 real images. In the training phrase, both cross-entropy loss and triple loss are optimized with this composite dataset. In the testing phase, the probabilities for each category are computed. In parallel, a similarity search is conducted, which used synthetic data as the reference and real data as the query.

Dataset	Total # of images	Training images	Testing images
Synthetic Data	80,872	80,872	-
Real Data	129,012	48,140	58,699

Inference Data Ground-truth Labeling Guidelines

The real training images were cropped from Retail Object Detection datasets with ground-truth bounding-boxes and categories by human labellers. To run inference on your own datasets, you may follow the guidelines below.

Reference Data Guidelines

Reference data is the database for similarity search during the inference stage for the Retail Object Recognition model. The prediction of the inference images would be decided by the L2 distances of the extracted features. Specifically, the algorithm would select the reference object with the smallest L2 distance to the query object in the reference database, and the predicted class would be the corresponding class of the selected reference object.

Therefore, to achieve the highest accuracy for retail object recognition, the reference data needs to be as close to the inference data as possible, regarding the background, occlusion, object orientations, etc.

For instance, if you decide that you only want to infer the retail objects with the front face, then you can collect the front side of the retail objects only as reference data. On the other hand, if you want the Retail Object Recognition model to recognize the objects with whatever angles presented, then more orientations of the retail objects need to be collected in the reference dataset.

Generally 20-30 images/class for reference data is of the highest efficiency. However, it would be definitely better to collect more reference examples, say 100 images/class.

Below are the guidelines for the specific conditions of the images:

All objects should take at least 70% of the frame.
Noisy background: images with noises in the backgrounds are fine.
Occlusion: objects occluded by distractions such as hands are fine. But at least 60% of the object should be visible.
Truncation: objects truncated by the edge of the frame with visibility of >= 60% are fine.
Each image should be assigned to a specific class. This model does not accommodate the case when multiple classes are classified to some “other”/”unknown” class.

Query Data Guidelines

Same as the reference data guidelines.

Notice that the Retail Object Recognition model can never correctly classify the retail objects if the class is not in the reference dataset.

To get the most accurate predictions, you should avoid challenging the Retail Object Recognition model with some bad views, such as the top of a soda can (as this view can be the same across many different retail objects).

Performance

Evaluation Data

Here, we present the evaluation results of Retail Object Recognition models. It is important to note that all test Key Performance Indicator (KPI) data are proprietary and derived from the test KPI data of Retail Object Detection model. For more information of the test dataset, please check Retail Object Detection - TRAINING AND TEST DATA.

Methodology and KPI

The performance of the Retail Object Recognition models is measured using the Accuracy, which is the proportion of correct predictions (all classes) made by the model out of all predictions.

Accuracy of the Classification Head

Model	Model Architecture	Training dataset description	Test dataset description	Accuracy
Retail Object Recognition	NV-Dinov2	A mixing of 80k synthetic images and 48kk real images, 315 retail classes. Real images are obtained from 7 scenes (added one more scene based on v1.0 model training dataset)	58,699 images, 315 retail classes, 7 scenes in total	0.9007

Accuracy of the Embedding Head.

# of test images/class	# of images/class in reference database	NV-Dinov2 Accuracy
180	1	0.8490
180	2	0.8616
180	3	0.8659
180	4	0.8695
180	5	0.8689
180	6	0.8714
180	7	0.8736
180	8	0.8717
180	9	0.8723
180	10	0.8729
180	20	0.8748
180	30	0.8747
180	40	0.8759
180	50	0.8772
180	60	0.8774
180	70	0.8754
180	80	0.8774
180	90	0.8772
180	100	0.8777

Real-time Inference Performance

The inference is run on the provided unpruned model at FP16 precision. The model input resolution is 224x224. The inference performance is run using trtexec on Orin NX 16GB, Jetson AGX Orin 64GB, A2, T4, A30, L4, L40, A100 and H100. The performance shown here is the inference only performance. The end-to-end performance with streaming video data might slightly vary depending on other bottlenecks in the hardware and software.

model	device	batch size	Latency (ms)	Images per second
Retail Object Recognition	Orin NX 16GB	4	135.76	29.46
Retail Object Recognition	AGX Orin 64GB	8	99.08	80.74
Retail Object Recognition	A2	16	220.39	72.6
Retail Object Recognition	T4	8	80.81	99.0
Retail Object Recognition	A30	16	34.76	460.3
Retail Object Recognition	L4	4	14.62	273.6
Retail Object Recognition	L40	8	14.01	571.1
Retail Object Recognition	A100	64	62.52	1,023.6
Retail Object Recognition	H100	64	25.63	2,496.6

How to use this model

Instructions to deploy Retail Object Recognition Models with DeepStream

We present examples of utilizing the Retail Object Recognition in conjunction with the Retail Object Detection for an end-to-end video analytic application. To implement this, deploy the models using the DeepStream, a streaming analytic toolkit to accelerate building AI-based video analytic applications. It supports direct integration of these models into the deepstream sample app.

Notice that due to DeepStream SDK 6.2 update, the instructions below can only derive PGIE output, so you are not able to get Retail Object Recognition outputs from the DeepStream SDK 6.2 at this point. A complete instruction would be announced once the DeepStream SDK 6.2 patch is added.

Download and install DeepStream SDK. The installation instructions for DeepStream are provided in DeepStream development guide. The config files for the purpose-built models are located in:
/opt/nvidia/deepstream is the default DeepStream installation directory. This path will be different if you are installing in a different directory.
The primary GIE config files are in $DS_TAO_APPS_HOME/configs/nvinfer/retail_object_detection_tao,

# meta-class detector (the primary GIE) inference setting 
pgie_retail_object_detection_binary_dino_tao_config.yaml
pgie_retail_object_detection_binary_dino_tao_config.txt
# Binary-class detector (the primary GIE) inference setting 
pgie_retail_object_detection_binary_dino_tao_config.yaml
pgie_retail_object_detection_binary_dino_tao_config.txt
pgie_retail_object_detection_binary_effdet_tao_config.yaml
pgie_retail_object_detection_binary_effdet_tao_config.txt

For more information, please refer to Retail Object Detection - INSTRUCTIONS TO DEPLOY THESE MODELS WITH DEEPSTREAM.

For the secondary GIE part of both version 1.0 and 2.0 models, please refer to the subsequent subsections.
Go to $DS_TAO_APPS_HOME/apps/tao_others/deepstream-mdx-perception-app and run:

cd $DS_TAO_APPS_HOME/apps/tao_others/deepstream-mdx-perception-app
deepstream-mdx-perception-app -m 3 -c ../../../configs/app/retail_object_detection_recognition.yml

Secondary GIE

You will need config files from these folders. These files are provided in NVIDIA-AI-IOT. Assume the repo is cloned under $DS_TAO_APPS_HOME, in $DS_TAO_APPS_HOME/configs/nvinfer/retail_object_recognition_tao

# Embedder model (the secondary GIE module) inference settings
sgie_retail_object_recognition_tao_config.yml

Key Parameters in sgie_retail_object_recognition_tao_config.yml

property:
gpu-id:0
net-scale-factor:0.01735207357
offsets:123.657;116.28;103.53
onnx-file:onnx_model.onnx
model-engine-file:trt_model.engine
tlt-model-key:nvidia_tlt
infer-dims:3;224;224
batch-size:16
# 0=FP32 and 1=INT8 mode
network-mode:0
network-type:100
interval:0
process-mode:2
gie-unique-id:3
classifier-threshold:0.0
operate-on-gie-id:1
output-tensor-meta:1
model-color-format:0
maintain-aspect-ratio:0
output-blob-names:probs;embeddings
operate-on-class-ids:0;1;2;3

Input image

Output image

Limitations

Very Small Objects

NVIDIA Retail Object Recognition models are trained to classify objects larger than 10x10 pixels. Therefore it may generate poor results when classifying objects that are smaller than 10x10 pixels.

Occluded Objects

When objects are occluded or truncated such that less than 40% of the object is visible, they may not be correctly classified by the Retail Objects Detection model. Partial occlusion by hand is acceptable as the model was trained with examples having random occlusions.

Monochrome or Infrared Camera Images

The Retail Object Recognition models are trained on RGB images. Therefore, images captured in a monochrome image or IR camera image may not provide good detection results.

Warped and Blurry Images

The Retail Object Recognition models are not trained on fish-eye lense cameras or moving cameras. Therefore, the models may not perform well for warped images and images that have motion-induced or other blur.

Model versions

Deployable: The decrypted onnx file for Retail Recognition model, inferencable on DeepStream.

References

Citations

Hoffer, Elad, and Nir Ailon. "Deep metric learning using triplet network." International workshop on similarity-based pattern recognition. Springer, Cham, 2015.
Na, Shi, Liu Xumin, and Guan Yong. "Research on k-means clustering algorithm: An improved k-means clustering algorithm." 2010 Third International Symposium on intelligent information technology and security informatics. Ieee, 2010.
Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Anima Anandkumar, Jiashi Feng and Jose M. Alvarez. "Understanding The Robustness in Vision Transformers". International Conference on Machine Learning (ICML). 2022
Maxime Oquab, Timothée Darcet, et al. "DINOv2: Learning Robust Visual Features without Supervision". arXiv:2304.07193, 2023

Technical blogs

Learn how to train real-time License plate detection and recognition app with TAO and DeepStream.
Model accuracy is extremely important, learn how you can achieve state of the art accuracy for classification and object detection models using TAO
Learn how to train and deploy real-time intelligent video analytics apps and services using DeepStream SDK

License

License to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.

Ethical Considerations

NVIDIA Retail Object Recognition model classifies retail objects. However, no additional information such as people and other distractors in the background are inferred. Training and evaluation dataset mostly consists of North American content. An ideal training and evaluation dataset would additionally include content from other geographies. NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.

Retail Object Recognition

Retail Object Recognition

Model Overview

Model Architecture

Training

Training Data

Inference Data Ground-truth Labeling Guidelines

Reference Data Guidelines

Query Data Guidelines

Performance

Evaluation Data

Methodology and KPI

Real-time Inference Performance

How to use this model

Instructions to deploy Retail Object Recognition Models with DeepStream

Secondary GIE

Input image

Output image

Limitations

Very Small Objects

Occluded Objects

Monochrome or Infrared Camera Images

Warped and Blurry Images

Model versions

References

Citations

Technical blogs

Suggested reading

License

Ethical Considerations