NGC | Catalog
CatalogModelsRetail Object Recognition

Retail Object Recognition

For downloads and more information, please view on a desktop device.
Logo for Retail Object Recognition


Embedding generator model to recognize objects on a checkout counter.



Use Case



Transfer Learning Toolkit

Latest Version



December 13, 2022


177.95 MB

Retail Object Recognition

Model Overview

This model encodes retail items to embedding vectors and predicts their labels based on the embedding vectors in the reference space.

Model Architecture

The model consists of a trunk and an embedder. The trunk uses the architecture of ResNet101 with its fully connected layer removed. The embedder is a one-layer Perceptron with an input size of 2048 (the output size of the Average Pool in ResNet101) and an output size of 2048. Thus the embedding dimension of the Retail Embedding model is 2048.


This model was trained with the Triplet Loss network algorithm. The training algorithm optimizes the network to minimize the embedding output distances (cosine similarity) between the positive images and the anchor image while maximizing the distances between the negative images and the anchor image.

The trunk and embedder use different learning rates during training. The embedder uses a smaller learning rate than the trunk for a better fine-tuning effect.

Training Data

The training data of the Retail Item Embedding model was cropped from images for Retail Item Detection model training and fine-tuning data [add link here]. Thus it is made up of both synthetic data and real data.

Specifically, the model was trained on a mixture of 0.6 million synthetic images and 50k real images. During the training phase, the triplet loss on the mixture would be optimized. And during the validation phase, the accuracy of the similarity search would be calculated. The reference data for the validation set are synthetic while the query data for the validation set are real. This setup is to help the model to overcome the simulation-to-reality gap, so the model is able to learn the class features from both synthetic and real image sources.

Multiple angles of the retail items were collected in the training data, thus the model was trained to recognize the retail item given a random angle of it.

dataset total #images train #images val #images
Synthetic data 600,000 570,000 30,000
Real data 53,476 47,569 5,907

Inference Data Ground-truth Labeling Guidelines

The real training images were cropped from Retail Item Detection datasets with ground-truth bounding-boxes and categories by human labellers. However, this model does not support re-train at this moment. It only allows inference on both seen and unseen classes. To run inference on your own datasets, you may follow the guidelines below.

Reference Data Guidelines

Reference data is the database for similarity search during the inference stage for the Retail Item Embedding model. The prediction of the inference images would be decided by the L2 distances of the extracted features. Specifically, the algorithm would select the reference object with the smallest L2 distance to the query object in the reference database by Kmeans, and the predicted class would be the corresponding class of the selected reference object.

Therefore, to achieve the highest accuracy for retail item recognition, the reference data needs to be as close to the inference data as possible, regarding the background, occlusion, object orientations, etc.

For instance, if you decide that you only want to infer the retail items with the front face, then you can collect the front side of the retail items only as reference data. On the other hand, if you want the Retail Item Embedding model to recognize the items with whatever angles presented, then more orientations of the retail items need to be collected in the reference dataset.

Generally 20-30 images/class for reference data is of the highest efficiency. However, it would be definitely better to collect more reference examples, say 100 images/class.

Below are the guidelines for the specific conditions of the images:

  1. All objects should take at least 70% of the frame.
  2. Noisy background: images with noises in the backgrounds are fine.
  3. Occlusion: objects occluded by distractions such as hands are fine. But at least 60% of the object should be visible.
  4. Truncation: objects truncated by the edge of the frame with visibility of >= 60% are fine.
  5. Each image should be assigned to a specific class. This model does not accommodate the case when multiple classes are classified to some “other”/”unknown” class.
Query Data Guidelines

Same as the reference data guidelines.

Notice that the Retail Item Embedding model can never correctly classify the retail item if the class is not in the reference dataset.

To get the most accurate predictions, you should avoid challenging the Retail Item Embedding model with some bad views, such as the top of a soda can (as this view can be the same across many different retail items).


Evaluation Data

The evaluation of the Retail Item Embedding model was measured against 100k images with 2000 classes from Aliproducts subset. The 2000 classes were selected based on the standard that they have > 150 train images/class. Notice that the images selected for evaluation are from the train set of the Aliproducts. This is because the validation set only has 2-4 images/class, which is not enough for our test. And all test classes are never seen by the Retail Item Embedding model before.

Aliproducts subset classes: 2000 Aliproducts classes list

Methodology and KPI

Mean accuracy across the classes is calculated. The KPI for the evaluation data are reported in the table below.

#test images/class #images/class in reference database overall mean class accuracy (%)
50 1 44.21
50 2 53.91
50 3 59.00
50 4 62.44
50 5 64.95
50 6 66.81
50 7 68.31
50 8 69.49
50 9 70.43
50 10 71.30
50 20 76.31
50 30 78.50
50 40 79.95
50 50 80.93
50 60 81.63
50 70 82.22
50 80 82.81
50 90 83.22
50 100 83.66

Real-time Inference Performance

The inference is run on the provided unpruned model at FP16 precision. The model input resolution is 224x224. The inference performance is run using trtexec on Jetson AGX Orin 64GB and A10. The performance shown here is the inference only performance. The end-to-end performance with streaming video data might slightly vary depending on other bottlenecks in the hardware and software.

model device batch size Latency (ms) Images per second
Retail Item Embedding Jetson AGX Orin 64GB 1 1.59 627
Retail Item Embedding Jetson AGX Orin 64GB 16 12.83 1247
Retail Item Embedding Jetson AGX Orin 64GB 32 23.61 1356
Retail Item Embedding Tesla A10 1 0.98 1018
Retail Item Embedding Tesla A10 16 5.95 2690
Retail Item Embedding Tesla A10 64 20.61 3106

How to use this model

Instructions to use unpruned model with TAO

This model temporarily does not support being used as pretrained weights for transfer learning.

Instructions to deploy these models with DeepStream

Here we give an example of using the Retail Item Embedder together with the Retail Item Detection[TODO: add url here] for an end-to-end video analytic application. To do so, deploy these models with DeepStream SDK 6.2. DeepStream SDK is a streaming analytic toolkit to accelerate building AI-based video analytic applications. It supports direct integration of these models into the deepstream sample app.

  1. Download and install DeepStream SDK. The installation instructions for DeepStream are provided in DeepStream development guide. The config files for the purpose-built models are located in:

  2. /opt/nvidia/deepstream is the default DeepStream installation directory. This path will be different if you are installing in a different directory.

You will need config files from two folders. These files are provided in NVIDIA-AI-IOT(TODO: Update the URL when deepstream_tao_apps are merged with???). Assume the repo is cloned under $DS_TAO_APPS_HOME, in $DS_TAO_APPS_HOME/configs/retailEmbedder_tao,

# Main config file driven by deepstream-mdx-perception 
# Header data for the metadata sent to a message broker
# Embedder model (the secondary GIE module) inference settings
# Defines the video sources

Key Parameters in sgie_retailEmbedder_tao_config.yml

  net-scale-factor: 0.003921568627451
  offsets: 0;0;0
  model-color-format: 0
  tlt-model-key: nvidia_tlt
  tlt-encoded-model: ../../models/retailEmbedder/retailEmbedder.etlt
  model-engine-file: ../../models/retailEmbedder/retailEmbedder.etlt_b16_gpu0_fp16.engine
  infer-dims: 3;224;224
  batch-size: 16
  ## 0=FP32, 1=INT8, 2=FP16 mode
  network-mode: 2
  network-type: 100
  interval: 0
  ## Infer Processing Mode 1=Primary Mode 2=Secondary Mode
  process-mode: 2
  output-tensor-meta: 1

And in $DS_TAO_APPS_HOME/configs/retailDetector_tao,

# 100-class detector (the primary GIE) inference setting 
# Binary-class detector (the primary GIE) inference setting 

Go to $DS_TAO_APPS_HOME/apps/tao_others/deepstream-mdx-perception-app and run:

cd $DS_TAO_APPS_HOME/apps/tao_others/deepstream-mdx-perception-app
deepstream-mdx-perception-app -c <retailEmbedder diretory>/config.yml

The "Deploying to DeepStream" chapter of TAO User Guide provides detailed documentation.

Input image
Output image


Very Small Objects

NVIDIA Retail Item Embedding model was trained to classify objects larger than 10x10 pixels. Therefore it may generate poor results when classifying objects that are smaller than 10x10 pixels.

Occluded Objects

When objects are occluded or truncated such that less than 40% of the object is visible, they may not be correctly classified by the Retail Item Detection model. Partial occlusion by hand is acceptable as the model was trained with examples having random occlusions.

Monochrome or Infrared Camera Images

The Retail Item Embedding model was trained on RGB images. Therefore, images captured in a monochrome image or IR camera image may not provide good detection results.

Warped and Blurry Images

The Retail Item Embedding model was not trained on fish-eye lense cameras or moving cameras. Therefore, the models may not perform well for warped images and images that have motion-induced or other blur. Model versions

Model versions

  • Deployable_v1.0



  • Hoffer, Elad, and Nir Ailon. "Deep metric learning using triplet network." International workshop on similarity-based pattern recognition. Springer, Cham, 2015.

  • Na, Shi, Liu Xumin, and Guan Yong. "Research on k-means clustering algorithm: An improved k-means clustering algorithm." 2010 Third International Symposium on intelligent information technology and security informatics. Ieee, 2010.

Technical blogs

Suggested reading


License to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.

Ethical Considerations

NVIDIA Retail Item Embedding model classifies retail items. However, no additional information such as people and other distractors in the background are inferred. Training and evaluation dataset mostly consists of North American content. An ideal training and evaluation dataset would additionally include content from other geographies. NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.