NGC | Catalog
CatalogModelsRetail Object Recognition

Retail Object Recognition

For downloads and more information, please view on a desktop device.
Logo for Retail Object Recognition

Description

Embedding generator model to recognize objects on a checkout counter.

Publisher

-

Latest Version

deployable_v2.0

Modified

July 26, 2023

Size

391.7 MB

Retail Object Recognition

Model Overview

This model encodes retail items to embedding vectors and predicts their labels based on the embedding vectors in the reference space.

In version 2.0, the Retail Item Embedding model both generates embedding vectors in a reference space and provides a classification output to categorize the objects into seen classes, which refer to the classes that have appeared in the training datasets.

Both versions 1.0 and 2.0 are adept at executing few-shot learning, having been trained through the process of metric learning. For a more comprehensive understanding of metric learning, please consult the provided references.

Model Version Summary

Model Version Model Architecture Input resolution Model Size Training Number of Classes Task Training data - synthetic images Training data - real images Decryption Code
Retail Item Embedding v1.0 ResNet101 224x224 46.7M 100 Recognition retail objects based on embedding vectors in reference space 600k, 100 retail subjects 50k, 100 retail subjects, cropped from 6 scenes nvidia_tlt
Retail Item Embedding v2.0 FAN-Base-Hybrid 224x224 205.8M 315 This task involves a dual-output network structure, which generates an embedding vectors in reference space and a classification output for categorizing the objects into seen classes. 80k synthetic data with improved complexity and variances compaed to training data in v1.0 48k, 315 retail subjects, cropped from 7 scenes (added one more scene from real training data in v1.0) None

Model Architecture

ResNet

Version 1.0 consists of a traunk and embedded. The trunk is a ResNet101 classification network with its fully connected layers removed. The embedder is a one-layer Perceptron with an input size of 2048 (the output dimension from the averagePool layer of the ResNet101 trunk) and an output size of 2048. Thus the embedding dimension of the Retail Embedding model is 2048.

FAN-Base-Hybrid

The Retail Item Embedding model (version 2.0) is composed of three primary components: a trunk, a classification head, and an embedding head. The trunk uses the FAN-Base-Hybrid (Fully Attentional Network) model architecture for feature extraction. The embedding head uses four adaptors with an output size of 448. The adaptor consists of two fully connected layers, GELU activation functions and residual connection. The classifier is derived from the adaptors along with a one-layer perceptron with an output size corresponding to the number of classes.

Training

Version 1.0

The Retail Item Embedding model (version 1.0) was trained with the Triplet Loss network algorithm. The training algorithm optimizes the network to minimize the embedding output distances (cosine similarity) between the positive images and the anchor image while maximizing the distances between the negative images and the anchor image.

The trunk and embedder use different learning rates during training. The embedder uses a smaller learning rate than the trunk for a better fine-tuning effect.

Version 2.0

For the Retail Item Embedding model (version 2.0), the training algorithm optimizes the network to minimize the embedding features and target classes for the objects. The trunck is initialized by the FAN-Base-Hybrid pretrained model and remains frozen during the training process. The embedder is trained using the Triplet Loss algorithm, an approach that enhances the network by reducing the cosine similarity-measured distances between the anchor image and positive images, while concurrently extending the distances between the anchor image and negative images. The classifier was trained through the Cross-entropy algorithm, minimizing the distance between the images and their corresponding classes. The classifier and embedder are jointly trained that can potentially enhances the precision by leveraged on the features from both branches.

Training Data

The training data of the Retail Item Embedding model was cropped from images for Retail Item Detection model training and fine-tuning data (see Retail Object Detection - TRAINING DATA). Thus it is made up of both synthetic data and real data. By mixturing the synthetic and real images in the training set, the model is enhanced, allowing it to bridge the gap between simulation and reality. Consequently, the model can learn feature representations from both synthetic and real image sources.

The training data encompasses multiple angles of the retail items, which equips the model to identify a retail item from any given angle.

Version 1.0

Specifically, the model was trained on a mixture of 0.6 million synthetic images and 50k real images. During the training phase, the triplet loss on the mixture would be optimized. And during the validation phase, the accuracy of the similarity search would be calculated. The reference data for the validation set are synthetic while the query data for the validation set are real images.

dataset Total # of images Training images Testing images
Synthetic data 600,000 600,000 -
Real data 56,898 53,476 3,422
Version 2.0

The Retail Item Embedding model (version 2.0) trained with a new retail dataset, which total has 315 distinct categories.

Specifically, the model's training involved a combined dataset of more than 80,000 synthetic images and 48,000 real images. In the training phrase, both cross-entropy loss and triple loss are optimized with this composite dataset. In the testing phase, the probabilities for each category are computed. In parallel, a similarity search is conducted, which used synthetic data as the reference and real data as the query.

Dataset Total # of images Training images Testing images
Synthetic Data 80,872 80,872 -
Real Data 129,012 48,140 58,699

Inference Data Ground-truth Labeling Guidelines

The real training images were cropped from Retail Item Detection datasets with ground-truth bounding-boxes and categories by human labellers. To run inference on your own datasets, you may follow the guidelines below.

Reference Data Guidelines

Reference data is the database for similarity search during the inference stage for the Retail Item Embedding model. The prediction of the inference images would be decided by the L2 distances of the extracted features. Specifically, the algorithm would select the reference object with the smallest L2 distance to the query object in the reference database by Kmeans, and the predicted class would be the corresponding class of the selected reference object.

Therefore, to achieve the highest accuracy for retail item recognition, the reference data needs to be as close to the inference data as possible, regarding the background, occlusion, object orientations, etc.

For instance, if you decide that you only want to infer the retail items with the front face, then you can collect the front side of the retail items only as reference data. On the other hand, if you want the Retail Item Embedding model to recognize the items with whatever angles presented, then more orientations of the retail items need to be collected in the reference dataset.

Generally 20-30 images/class for reference data is of the highest efficiency. However, it would be definitely better to collect more reference examples, say 100 images/class.

Below are the guidelines for the specific conditions of the images:

  1. All objects should take at least 70% of the frame.
  2. Noisy background: images with noises in the backgrounds are fine.
  3. Occlusion: objects occluded by distractions such as hands are fine. But at least 60% of the object should be visible.
  4. Truncation: objects truncated by the edge of the frame with visibility of >= 60% are fine.
  5. Each image should be assigned to a specific class. This model does not accommodate the case when multiple classes are classified to some “other”/”unknown” class.
Query Data Guidelines

Same as the reference data guidelines.

Notice that the Retail Item Embedding model can never correctly classify the retail item if the class is not in the reference dataset.

To get the most accurate predictions, you should avoid challenging the Retail Item Embedding model with some bad views, such as the top of a soda can (as this view can be the same across many different retail items).

Performance

Evaluation Data

Here, we present the evaluation results of Retail Item Embedding models, including versions 1.0, 2.0, and an unreleased version 1.1. The purpose of version 1.1 is to serve as a benchmark for comparing the model architecture potentials of v1.0 and v2.0. It is important to note that all test Key Performance Indicator (KPI) data are proprietary and derived from the test KPI data of Retail Object Detection model. For more information of the test dataset, please check Retail Object Detection - TRAINING AND TEST DATA.

Methodology and KPI

The performance of the Retail Item Embedding models are mainly measured using the Accuracy, which is the proportion of correct predictions (all classes) made by the model out of all predictions.

Accuracy of the Classification Head

Model Model Architecture Training dataset description Test dataset description Accuracy
Retail Item Embedding - v1.0 ResNet 101 A mixture of 60k synthetic images and 53k real images, 100 retail subjects. Real images are obtained from 6 scenes. 3,422 images, 100 retail subjects, 6 scenes in total 0.8453
Retail Item Embedding - v1.1 (not released) ResNet 101 A mixture of 80k synthetic images and 48kk real images, 315 retail subjects. Real images are obtained from 7 scenes (added one more scene based on v1.0 model training dataset) 58,699 images, 315 retail subjects, 7 scenes in total (added one more scene based on v1.0 model test dataset) 0.7038
Retail Item Embedding - v2.0 FAN-Base-Hybrid Same as v1.1 Same as v1.1 0.8797

Accuracy of the Embedding Head.

# of test images/class # of images/class in reference database FAN-Base-Hybrid Accuracy
180 1 0.7718
180 5 0.8432
180 10 0.8485
180 20 0.8541
180 30 0.8561
180 40 0.8557
180 50 0.8577
180 60 0.8567
180 70 0.8581
180 80 0.8591
180 90 0.8605
180 100 0.8587

Real-time Inference Performance

The inference is run on the provided unpruned model at FP16 precision. The model input resolution is 224x224. The inference performance is run using trtexec on Jetson AGX Orin 64GB and A10. The performance shown here is the inference only performance. The end-to-end performance with streaming video data might slightly vary depending on other bottlenecks in the hardware and software.

Version 1.0
model device batch size Latency (ms) Images per second
Retail Item Embedding - v1.0 Jetson AGX Orin 64GB 1 1.59 627
Retail Item Embedding - v1.0 Jetson AGX Orin 64GB 16 12.83 1247
Retail Item Embedding - v1.0 Jetson AGX Orin 64GB 32 23.61 1356
Retail Item Embedding - v1.0 Tesla A10 1 0.98 1018
Retail Item Embedding - v1.0 Tesla A10 16 5.95 2690
Retail Item Embedding - v1.0 Tesla A10 64 20.61 3106
Version 2.0
model device batch size Latency (ms) Images per second
Retail Item Embedding - v2.0 Jetson Orin Nano 4 99.50 40.2
Retail Item Embedding - v2.0 Orin NX 16GB 4 66.89 59.8
Retail Item Embedding - v2.0 AGX Orin 64GB 8 49.69 161
Retail Item Embedding - v2.0 A2 16 99.38 161
Retail Item Embedding - v2.0 T4 8 31.62 253
Retail Item Embedding - v2.0 A30 16 21.80 734
Retail Item Embedding - v2.0 L4 4 6.63 603
Retail Item Embedding - v2.0 L40 8 4.85 1648
Retail Item Embedding - v2.0 A100 64 39.48 1621
Retail Item Embedding - v2.0 H100 64 23.83 2686

How to use this model

Instructions to use unpruned model with TAO

The trainable version of this model can be used for transfer learning via TAO Toolkit, while the deployable versions can be used for inference deployment via DeepStream/TensorRT.

Instructions to transfer learning the model with TAO Toolkit

Currently we only support transfer learning for Retail Item Embedding version 1.0. To proceed with this, kindly adhere to the following steps.

  1. Download and install TAO Toolkit.

  2. In order to use the model as pre-trained weights for transfer learning, please use the snippet below as a template for the model component of the experiment spec file to train a Metric Learning Recognition model. Below is a code snippt to config the pretrained weights of Retail Item Embedding model v1.0 to Metric Learning Recognition transfer learning in TAO.

model:
  backbone: resnet_101
  pretrain_choice: ""
  pretrained_model_path: /path/to/retail_object_recognition.pth
  input_width: 224
  input_height: 224
  feat_dim: 2048

train:
  optim:
    ...
  smooth_loss: True
  num_epochs: 100
  checkpoint_interval: 5
  resume_training_checkpoint_path: null
  batch_size: 16
  val_batch_size: 16
  report_accuracy_per_class: True

dataset:
  train_dataset: /path/to/training/dataset
  val_dataset: 
    reference: /path/to/reference/dataset
    query: /path/to/query/dataset
  1. To start the training of Metric Learning Recognition model, use following command:
tao model ml_recog train -e=<train config file>
  1. For more information on experiment spec file and training with TAO Toolkit, please refer to the notebook example at TAO Toolkit - Jupyter notebooks - Retail Object Recognition.

Instructions to deploy Retail Item Embedding Model (version 1.0) with TAO Toolkit

The TAO Toolkit also provides support for deploying Retail Item Embedding model v1.0 through TAO-Deploy. The ONNX model can be converted into a TensorRT engine first and then used for inference with a reference database. To proceed with this, please follow the steps below after downloading and installing the TAO Toolkit as instructed in the last section.

  1. Convert the deployable (onnx file) Retail Item Embedding model v1.0 to TensorRT engine. Below is the code snippt for this conversion.
gen_trt_engine:
  gpu_id: 0
  onnx_file: /path/to/deployable/retail/object/recognition/v1.0
  trt_engine: /path/to/converted/trt/engine/file
  tensorrt:
    data_type: fp32
    workspace_size: 1024
    min_batch_size: 1
    opt_batch_size: 10
    max_batch_size: 10

To launch the TensorRT conversion, use

tao deploy ml_recog gen_trt_engine -e=<trt conversion config file>
  1. Run Inference with the converted TensorRT engine file. Below is a code snippt to config the pretrained weights of Retail Item Embedding model v1.0 to Metric Learning Recognition transfer learning in TAO.
inference:
  trt_engine: /path/to/converted/trt/engine/file
  input_path: /path/to/test/images
  inference_input_type: image_folder
  topk: 5
  ...

dataset:
  train_dataset: ""
  val_dataset: 
    reference: /path/to/reference/dataset
    query: ""
  ...

To conduct the inference, use following command:

tao deploy ml_recog inference -e=<inference config file>
  1. For more information on experiment spec files and running inference with TAO Toolkit, please refer to the notebook example at TAO Toolkit - Jupyter notebooks - Retail Object Recognition.

Instructions to deploy Retail Item Embedding Models with DeepStream

We present examples of utilizing the Retail Item Embedding in conjunction with the Retail Item Detection for an end-to-end video analytic application. To implement this, deploy the models using the DeepStream SDK 6.2, a streaming analytic toolkit to accelerate building AI-based video analytic applications. It supports direct integration of these models into the deepstream sample app.

Notice that due to DeepStream SDK6.2 update, the instructions below can only derive PGIE output, so you are not able to get Retail Item Embedding outputs from the DeepStream 6.2 at this point. A complete instruction would be announced once the DeepStream6.2 patch is added.

  1. Download and install DeepStream SDK. The installation instructions for DeepStream are provided in DeepStream development guide. The config files for the purpose-built models are located in:

  2. /opt/nvidia/deepstream is the default DeepStream installation directory. This path will be different if you are installing in a different directory.

  3. The primary GIE config files are in $DS_TAO_APPS_HOME/configs/nvinfer/retail_object_detection_tao,

# meta-class detector (the primary GIE) inference setting 
pgie_retail_object_detection_binary_dino_tao_config.yaml
pgie_retail_object_detection_binary_dino_tao_config.txt
# Binary-class detector (the primary GIE) inference setting 
pgie_retail_object_detection_binary_dino_tao_config.yaml
pgie_retail_object_detection_binary_dino_tao_config.txt
pgie_retail_object_detection_binary_effdet_tao_config.yaml
pgie_retail_object_detection_binary_effdet_tao_config.txt

For more information, please refer to Retail Item Detection - INSTRUCTIONS TO DEPLOY THESE MODELS WITH DEEPSTREAM.

  1. For the secondary GIE part of both version 1.0 and 2.0 models, please refer to the subsequent subsections.

  2. Go to $DS_TAO_APPS_HOME/apps/tao_others/deepstream-mdx-perception-app and run:

cd $DS_TAO_APPS_HOME/apps/tao_others/deepstream-mdx-perception-app
deepstream-mdx-perception-app -m 3 -c ../../../configs/app/retail_object_detection_recognition.yml
Version 1.0 Secondary GIE

For version 1.0 model secondary GIE, you will need config files from two folders. These files are provided in NVIDIA-AI-IOT. Assume the repo is cloned under $DS_TAO_APPS_HOME, in $DS_TAO_APPS_HOME/configs/retail_object_recognition_tao,

# Embedder model (the secondary GIE module) inference settings
sgie_retail_object_recognition_tao_config.yml

Key Parameters in sgie_retail_object_recognition_tao_config.yml

property:
  net-scale-factor: 0.003921568627451
  offsets: 0;0;0
  model-color-format: 0
  tlt-model-key: nvidia_tlt
  tlt-encoded-model: ../../models/retailEmbedder/retailEmbedder.etlt # switch to onnx-file if the inpu file is an onnx model, and the tlt-model-key field would no longer be needed as well
  model-engine-file: ../../models/retailEmbedder/retailEmbedder.etlt_b16_gpu0_fp16.engine
  infer-dims: 3;224;224
  batch-size: 16
  ## 0=FP32, 1=INT8, 2=FP16 mode
  network-mode: 2
  network-type: 100
  interval: 0
  ## Infer Processing Mode 1=Primary Mode 2=Secondary Mode
  process-mode: 2
  output-tensor-meta: 1
Version 2.0 Secondary GIE

You will need config files from these folders. These files are provided in NVIDIA-AI-IOT. Assume the repo is cloned under $DS_TAO_APPS_HOME, in $DS_TAO_APPS_HOME/configs/nvinfer/retail_object_recognition_tao

# Embedder model (the secondary GIE module) inference settings
sgie_retail_object_recognition_tao_config.yml

Key Parameters in sgie_retail_object_recognition_tao_config.yml

property:
gpu-id:0
net-scale-factor:0.01735207357
offsets:123.657;116.28;103.53
onnx-file:onnx_model.onnx
model-engine-file:trt_model.engine
tlt-model-key:nvidia_tlt
infer-dims:3;224;224
batch-size:16
# 0=FP32 and 1=INT8 mode
network-mode:0
network-type:100
interval:0
process-mode:2
gie-unique-id:3
classifier-threshold:0.0
operate-on-gie-id:1
output-tensor-meta:1
model-color-format:0
maintain-aspect-ratio:0
output-blob-names:probs;embeddings
operate-on-class-ids:0;1;2;3
Input image
Output image

Limitations

Very Small Objects

NVIDIA Retail Item Embedding models are trained to classify objects larger than 10x10 pixels. Therefore it may generate poor results when classifying objects that are smaller than 10x10 pixels.

Occluded Objects

When objects are occluded or truncated such that less than 40% of the object is visible, they may not be correctly classified by the Retail Item Detection model. Partial occlusion by hand is acceptable as the model was trained with examples having random occlusions.

Monochrome or Infrared Camera Images

The Retail Item Embedding models are trained on RGB images. Therefore, images captured in a monochrome image or IR camera image may not provide good detection results.

Warped and Blurry Images

The Retail Item Embedding models are not trained on fish-eye lense cameras or moving cameras. Therefore, the models may not perform well for warped images and images that have motion-induced or other blur. Model versions

Model versions

  • Deployable_v1.0: The encrypted onnx file and encrypted etlt file for Retail Item Embedding model, inferencable on DeepStream and TAO Toolkit.
  • Trainable_v1.0: The checkpoint for Retail Item Embedding model, trainable on TAO Toolkit.
  • Deployable_v2.0: The decrypted onnx file for Retail Recognition model, inferencable on DeepStream.

References

Citations

  • Hoffer, Elad, and Nir Ailon. "Deep metric learning using triplet network." International workshop on similarity-based pattern recognition. Springer, Cham, 2015.

  • Na, Shi, Liu Xumin, and Guan Yong. "Research on k-means clustering algorithm: An improved k-means clustering algorithm." 2010 Third International Symposium on intelligent information technology and security informatics. Ieee, 2010.

  • Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Anima Anandkumar, Jiashi Feng and Jose M. Alvarez. "Understanding The Robustness in Vision Transformers". International Conference on Machine Learning (ICML). 2022

Technical blogs

Suggested reading

License

License to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.

Ethical Considerations

NVIDIA Retail Item Embedding model classifies retail items. However, no additional information such as people and other distractors in the background are inferred. Training and evaluation dataset mostly consists of North American content. An ideal training and evaluation dataset would additionally include content from other geographies. NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.