NGC | Catalog
CatalogModelsRetail Object Detection

Retail Object Detection

Logo for Retail Object Detection
Description
Object detection network to detect retail objects on a checkout counter.
Publisher
-
Latest Version
deployable_100_unencrypted_v1.0
Modified
January 25, 2024
Size
129.59 MB

Retail Object Detection

Model Overview

The models described in this card detect retail items within an image and return a bounding box around each detected item. The retail items are typically packaged commercial goods with barcodes and ingredients labels on them.

Three types of retail object detection models are provided here:

  • Binary-class detection model: detects general retail items and returns a single category.

  • 100-class detection model: detects specific 100 retail subjects and returns their subject names.

    The list of 100 retail subjects is at The 100 retail subject table.

  • Meta-class detection model: detects retail items and returns their geometry shapes. There are 10 geometry shape classes:

    • oval container
    • cylindrical container
    • bottle container
    • round container (*)
    • box container
    • Rectangular prism with protrusion
    • shallow rectangular prism
    • modified cylindrical container with short neck
    • bag container
    • miscellaneous container

    The descriptions of each of the 10 meta classes can be found in this Google sheet.

    * The round container class functions as a placeholder, because the training or fine-tune set does not cover round container retail subjects.

Model Version Summary

The following table chronicles the progression of Retail Object Detection models across various versions:

Model Version Model Architecture Input Resolution Model Size Number of Classes Task Training Data Fine-Tune Data Decryption Code
Retail Object Detection - binary v2.1.1 DINO-FAN_base 960x544 73.0 M 1 Detects retail items and ran uniform category. 226k synthetic data with quality improved based on v2.0 model training datasets 642 real images. 627 retail subjects . 40 scenes. Compared to v2.0 model finetune dataset, 33 scenes are added. None
Retail Object Detection - binary v2.1 DINO-FAN_small 960x544 48.3 M 1 Detects retail items and ran uniform category. 226k synthetic data with quality improved based on v2.0 model training datasets 642 real images. 627 retail subjects . 40 scenes. Compared to v2.0 model finetune dataset, 33 scenes are added. None
Retail Object Detection - meta v2.0 DINO-FAN_base 960x544 73.0 M 10 Detects retail items and return their geometry shapes. The geometry meta-class list can be found at `class_map.txt` in model files 320k synthetic data with texture and complexity improved based on v1.0 model training datasets 1123 real images. 315 classes. 7 scenes. Compared to v1.0 model finetune dataset, one more scene is added. None
Retail Object Detection - binary v2.0 DINO-FAN_base 960x544 73.0 M 1 Detects retail items and return an uniform category 320k synthetic data with texture and complexity improved based on v1.0 model training datasets 1123 real images. 315 classes. 7 scenes. Compared to v1.0 model finetune dataset, one more scene is added. None
Retail Object Detection - binary v1.1 Efficientdet-D5 960x544 33.7M 1 Detects retail items and return an uniform category 320k synthetic data with texture and complexity improved based on v1.0 model training datasets 1123 real images. 315 classes. 7 scenes. Compared to v1.0 model finetune dataset, one more scene is added. nvidia_tao
Retail Object Detection - 100-class v1.0 Efficientdet-D5 416x416 33.9M 100 Detects specific 100 retail subjects. The subject list can be found at `class_map.txt` in model files 1.5M synthetic data 518 real images, 100 classes, 6 scenes. nvidia_tlt
Retail Object Detection - binary v1.0 Efficientdet-D5 416x416 33.7M 1 Detects retail items and return an uniform category 1.5M synthetic data 518 real images, 100 classes, 6 scenes. nvidia_tlt

Model Architecture

Two distinct model architectures are used across different versions: EfficientDet-D5 and DINO-FAN_base.

EfficientDet-D5

For models based on EfficientDet-D5 architecture.

EfficientDet is a one-stage detector with the following architecture components:

  • NvImageNetV2 pretrained EfficientNet-B5 backbone
  • Weighted bi-directional feature pyramid network (BiFPN)
  • Bounding and classification box head
  • A compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time
  • For more information on model architecture and features, see literature studies at Citation part

DINO-FAN

For models based on DINO-FAN architecture:

  • The DINO (DETR with Improved deNoising anchOrboxes) model is a state-of-the-art end-to-end object detector that improves on previous DETR-like models in performance and efficiency. It introduces several novel techniques, including contrastive DN training, mixed query selection, and looking forward twice. DINO scales well for model size and data size.

  • Either FAN_base or FAN_small is used as the backbone. FAN is the Fully Attentional Network that enhances the traditional transformer. A FAN block has both token self-attention and channel attention applied, making the entire network fully attentional. The linear projection layer after the channel attention is removed. A FAN base model has 18 FAN blocks and their channel dimensions are 448 and they have 8 heads. A FAN small model has 12 FAN blocks and their channel dimensions are 384 and they have 8 heads.

  • For more information on model architecture and features, see the literature studies at Citation part.

Training

The models are trained using the efficientdet_tf2 or dino entrypoints in TAO. The trainings are carried out in two phases. In the first phase, the networks loaded with pretrained weights are trained on a large number of synthetic data. In the second phase, the networks are fine-tuned using a small number of real samples.

Freezing some modules during training and finetuning is able to improve model adaptation to new datsets and training efficiency. The following are the suggested modules for freezing during training and finetuning:

Model Version Model Architecture Model Pretrained Weights Training Frozen Modules Fine-Tune Frozen Modules
Retail Object Detection - binary v2.1.1 DINO-FAN_base OpenImage pretrained DINO-FAN_base backbone, transformer encoder -
Retail Object Detection - binary v2.1 DINO-FAN_small OpenImage pretrained DINO-FAN_small backbone, transformer encoder -
Retail Object Detection - meta v2.0 DINO-FAN_base NVImageNet pretrained FAN_base - -
Retail Object Detection - binary v2.0 DINO-FAN_base NVImageNet pretrained FAN_base - -
Retail Object Detection - binary v1.1 Efficientdet-D5 NVImageNet pretrained EfficientNet-B5 - backbone
Retail Object Detection - 100-class v1.0 Efficientdet-D5 NVImageNet pretrained EfficientNet-B5 - backbone
Retail Object Detection - binary v1.0 Efficientdet-D5 NVImageNet pretrained EfficientNet-B5 - backbone

Overview

This training dataset has several advantages over publicly available retail datasets, including:

  • The training data encompasses an extensive array of retail subjects, featuring 315 distinct subjects in the training set and an additional 52 subjects in the test set.
  • It contains 3D scanned models collected for all the retail items present in the training dataset, which facilitates the generation of synthetic images. With appropriate simulation, you can augment the training dataset size considerably while maintaining high-quality labeling and comprehensive control over data distribution.
  • It contains real images that feature the same set of retail subjects. These images, captured in seven different scenes with people strolling around or interacting with the target objects, serve as valuable resources for fine-tuning and testing.
  • For both synthetic and real data, the frequency of each object's appearance in the dataset is approximately equal.
# training images # retail subjects # scenes average object density / frame #classes
synthetic data - v1.0 train 1,425,000 100 1,425,000 1 1 (binary), 100
synthetic data - v2.0 train 265,806 100+215 265,806 7.18 1 (binary), 10
synthetic data - v2.1 and v2.1.1 train 226,730 100+215 226,730 5.26 1 (binary)
real data - v1.0 finetune 518 100 6 1 1 (binary), 100
real data - v2.0 finetune 952 100+215 6+1 1 1 (binary), 10
real data - v2.1 and v2.1.1 finetune 642 100+215+98(*) 6+1+33 1.35 1 (binary)

(*) The added 98 retail objects in real data - v2.1 and v2.1.1 fine-tune does not have 3D scanned files collected.

The versions mentioned in the table correspond to the versions of the models.

Synthetic Data - v1.0 Train

Synthetic data - v1.0 train contains images with following features:

  • Each frame is composed of a 2D image background, with a 3D retail object inserted. The background textures are real images sampled from proprietary real images.

  • The synethetic data randomizes several simulation domains, including:

    • light types, light intensities
    • object sizes, orientations, and locations
    • camera locations
    • background textures

Synthetic Data - v2.0 Train

Synthetic data - v2.0 train was improved from synthetic data - v1.0 in following ways:

  • Each synthetic image contains 1-20 target retail item.
  • Flying distractors are added to each image.
  • Added 215 retail subjects to the training data.

Synthetic Data - v2.1 and v2.1.1 Train

Synthetic data - v2.1 train was improved from synthetic data - v2.0 in following ways:

  • Added gravity that retail subjects are lying on the table. This can better simulate the real world where retail objects are lying on the shelves, tables, or conveyor belt.
  • Fixed some bugs in v2.0 train data:
    • Removed annotations of highly occluded objects.
    • The flying distractors are removed. This is to avoid the ambiguity in the definition of retail objects affecting model performance.

Real Data - v1.0 Fine-Tune

The v1.0 models are fine-tuned on real proprietary images from six different real environments in Voyager Cafe. In each environment, only a few images per item are collected.

The fine tuning data is captured under random camera heights and field of views. All fine tuning data was collected indoors, having retail items placed on the checkout counter, shelf, baskets, conveyor belts, or home. The camera is typically set up at approximately 10 feet height, 45-degree, 90 degree and 180-degree angles off the vertical axis and has close field-of-view. This content was chosen to decrease the simulation-to-reality gap of the model trained on synthetic data, and to improve the accuracy and the robustness of the model. The logos on retail items were smudged.

Real Data - v2.0 Fine-Tune

Real data - v2.0 fine-tune added a new scene with additional 215 retail subjects based on real data - v1.0 fine-tune.

The added scene is placing items on a conveyor belt and the camera is facing the top of the conveyor belt or the side of it.

Fine-Tune Dataset train #images
Voyager Cafe - checkout counter 45 overhead 85
Voyager Cafe - shelf 85
Voyager Cafe - conveyor belt 1 84
Voyager Cafe - basket 84
Voyager Cafe - checkout counter barcode scanner view 100
Voyager Cafe - checkout counter overhead 80
Voyager Cafe - conveyor belt 2 (new scene added in v2.0 finetune) 434
Total 952

Because Retail Object Detection - meta v2.0 model was offered, the following table has the meta class distributions of the train and fine-tune data:

Meta Class Train #instances Finetune #instances Percentage
oval container 36549 14 1%
cylindrical container 122583 73 8%
bottle container 92514 44 5%
round container 0 0 0%
box container 264973 627 66%
Rectangular prism with protrusion 64063 26 3%
shallow rectangular prism 75571 32 3%
modified cylindrical container with short neck 11335 4 0%
bag container 99008 60 6%
miscellaneous container 113803 72 8%
Total 880399 952 100%

Real Data - v2.1 and v2.1.1 Fine-Tune

Real data - v2.1 and v2.1.1 fine-tune is improved from real data - v2.0 fine-tune in the following ways:

  • Added 33 new secnes. The new scenes are from various environments, mainly from home and grocery stores. The images are collected by phones from many people because of this, the raw images have different resolutions and views. If you keep the original image sizes as the input, the model resizes the inputs to 960x544 uniformly.

  • 98 new retail subjects are included in the newly added 33 scenes.

  • To enhance the dataset characterized by a skewed distribution, the distribution of meta-classes is rebalanced, ensuring a more equitable representation across various retail subjects and reducing the model's bias towards predominantly detecting box containers. However, for some classes, having very few examples in our database, their proportion is only increased slightly.

    This modification allows for fewer fine-tune images to achieve a better performance when compared to the Retail Object Detection - binary v2.0 model.

    To avoid the following issues, training of the meta-class detection model is no longer done:

    • The definition of retail subjects is subjective.
    • The covered meta-classes are only derived from 315 retail subjects. It is hard to categorize other retails subjects, such as shampoo containers and edged cans, using this framework.
    • The adjusted fine-tune set cannot well balance the meta classes.
Meta-Class Finetune # instances Percentage
oval container 35 4%
cylindrical container 168 20%
bottle container 165 20%
round container 0 0%
box container 108 13%
Rectangular prism with protrusion 147 17%
shallow rectangular prism 40 5%
modified cylindrical container with short neck 20 2%
bag container 88 10%
miscellaneous container 71 8%
Total 842(*) 100%

* The total #instances is >= total #images because each frame is likely to contain more than one retail instance.

The following is the scene distribution of the v2.1 and v2.1.1 fine-tune data:

Data # Images
Voyager Cafe - 6 scenes from v1.0 finetune 103
Voyager Cafe - conveyor belt 2 (new scene added in v2.0 finetune) 206
Crowdsourcing data (new scene added in v2.1 & v2.1.1 finetune) 333
Total 642

The following are some examples of added images. The trademarks on retail subjects are smudged:

Fine-Tuning Data Ground-Truth Labeling Guidelines

The fine-tuning data is created by labeling ground-truth bounding-boxes and categories by human-labelers. The following guidelines were used while labeling the training data for NVIDIA Retail Object Detection models. If you are looking to transfer-learn or to fine-tune the models to adapt to your target environment and classes, use the following the guidelines for better model accuracy:

  • All objects that fall under the definition of retail items and are larger than the smallest bounding-box limit for the corresponding class (height >= 10px OR width >= 10px) are labeled with the appropriate class label.
  • Occlusion: Classification for partially occluded objects that are visible approximately 60% or are marked as visible objects with a bounding box around the visible part of the object. These objects are marked as partially occluded. Objects under 60% visibility are not annotated.
  • Truncation: An object, at the edge of the frame, which is 60% or more visible, is marked with the truncation flag.
  • Categories: The target objects, primarily retail items, that are distinguishable from distractors in the images, and are typically commercially packaged with barcode labels attached. In this context, avoid labeling appliances or produce as target objects.

Performance

Evaluation Data

The performance of the Retail Object Detection model was gauged using three sets of KPI data.

The details of each set of KPI data are listed in the following table:

Scenes Retail Objects # Test Images
Voyager Cafe same as scenes in v2.0 fine-tune data same 315 retail subjects as in v2.0 fine-tune data 6,042
Retail Product Checkout dataset (*) overhead views of retail subjects on table new 200 retail subjects 24,000

* The Retail Product Checkout dataset (RPC dataset) is a public dataset. It contains 200 retail subjects in its test data. TAO tests the binary Retail Object Detection models on it by updating its test data annotation file to binary class.

There are a few retail subjects in the RPC dataset that do not match the TAO definition in retail items, such as pencil cases (as it does not have commercial packages or barcode attached to it). It is expected that the zero-shot test of the binary Retail Object Detection models misses such items.

The Voyager Cafe evaluation results can be regarded as a few-shot learning evaluation. While the evaluation results on the Retail Product Checkout dataset can be regarded as zero-shot.

Methodology and KPI

AP50 is calculated using intersection-over-union (IOU) criterion greater than 0.5. The KPI for the evaluation data are reported in the following table. The model is evaluated based on AP50 and AR0.5:0.95. Both AR and AP numbers are based on 100 maximum detections each image.

Binary-Class Retail Object Detection Models
Model Voyager Cafe (Few Shot) RPC Dataset (Zero Shot)
AP0.5 AR0.5:0.95 AP0.5 AR0.5:0.95
Retail Object Detection - binary v2.1.1 0.969 0.955 0.972 0.845
Retail Object Detection - binary v2.1 0.967 0.941 0.971 0.847
Retail Object Detection - binary v2.0 0.967 0.898 0.94 0.719
Retail Object Detection - binary v1.1 0.956 0.889 0.733 0.631
Retail Object Detection - binary v1.0 0.94 0.86 0.168 0.266
Retail Object Detection - Meta v2.0

Meta v2.0 model is only evaluated on our internal Voyager Cafe data as few-shot evaluation.

The following is the classwise accuracy of each meta class:

meta class AP0.5 AR0.5:0.95
oval container 0.949 0.966
cylindrical container 0.913 0.902
bottle container 0.857 0.937
box container 0.972 0.967
Rectangular prism with protrusion 0.98 0.994
shallow rectangular prism 1.00 1.00
modified cylindrical container with short neck 0.604 0.56
bag container 0.927 0.916
miscellaneous container 0.922 0.945
overall 0.903 0.910

* Because the round container is not included in our KPI data, the accuracy table does not include it.

Retail Object Detection - 100-Class v1.0

For model v1.0, there are only six scenes in the Voyager Cafe. Teh following is the breakdown of the model performance in each scene for the 100-class v1.0 model:

Scene Seen Items Result (AP50) Seen Items Result (AR MaxDets=100)
Voyager Cafe - checkout counter 45 overhead 0.564 0.741
Voyager Cafe - shelf 0.933 0.860
Voyager Cafe - conveyor belt 0.872 0.888
Voyager Cafe - basket 0.536 0.722
Voyager Cafe - checkout counter barcode scanner view 0.845 0.758
Voyager Cafe - checkout counter overhead 0.926 0.859
average 0.779 0.805

Real-Time Inference Performance

The inference is run on the provided unpruned model at FP16 precision. The model input resolution is 960x544. The inference performance is measured using the trtexec on Jetson AGX Orin 64GB and A10. The following performance table, only captures latency of the forward pass inference with the model. The end-to-end performance with streaming video data might vary depending on other bottlenecks in the hardware and software.

Model Performance Results

model device TensorRT Version batch size latency (ms) images per second
Retail Object Detection - binary v2.1.1 Jetson AGX Orin 64GB 8.6.2.3 1 85.48 11.7
Retail Object Detection - binary v2.1.1 Jetson AGX Orin 64GB 8.6.2.3 4 334.23 11.97
Retail Object Detection - binary v2.1.1 Jetson AGX Orin 64GB 8.6.2.3 8 673.31 11.88
Retail Object Detection - binary v2.1.1 DGX H100 80GB 8.6.2.3 1 9.08 110.14
Retail Object Detection - binary v2.1.1 DGX H100 80GB 8.6.2.3 16 99.54 160.74
Retail Object Detection - binary v2.1.1 DGX H100 80GB 8.6.2.3 32 193.61 165.28
Retail Object Detection - binary v2.1 Jetson AGX Orin 64GB 8.6.2.3 1 64.28 15.56
Retail Object Detection - binary v2.1 Jetson AGX Orin 64GB 8.6.2.3 4 130.49 15.44
Retail Object Detection - binary v2.1 Jetson AGX Orin 64GB 8.6.2.3 8 517.43 15.46
Retail Object Detection - binary v2.1 DGX H100 80GB 8.6.2.3 1 6.96 143.61
Retail Object Detection - binary v2.1 DGX H100 80GB 8.6.2.3 16 76.03 210.45
Retail Object Detection - binary v2.1 DGX H100 80GB 8.6.2.3 32 148.13 216.03
Retail Object Detection - binary v2.0 Jetson AGX Orin 64GB 8.5.2.2 1 123.63 8.09
Retail Object Detection - binary v2.0 Jetson AGX Orin 64GB 8.5.2.2 4 468.38 8.54
Retail Object Detection - binary v2.0 Jetson AGX Orin 64GB 8.5.2.2 8 981.46 8.15
Retail Object Detection - binary v2.0 Tesla A30 8.5.2.2 1 25.25 39.61
Retail Object Detection - binary v2.0 Tesla A30 8.5.2.2 4 91.94 43.51
Retail Object Detection - binary v2.0 Tesla A30 8.5.2.2 8 180.81 44.25
Retail Object Detection - meta v2.0 Jetson AGX Orin 64GB 8.5.2.2 1 123.52 8.1
Retail Object Detection - meta v2.0 Jetson AGX Orin 64GB 8.5.2.2 4 468.19 8.54
Retail Object Detection - meta v2.0 Jetson AGX Orin 64GB 8.5.2.2 8 930.6 8.6
Retail Object Detection - meta v2.0 Tesla A30 8.5.2.2 1 25.34 39.47
Retail Object Detection - meta v2.0 Tesla A30 8.5.2.2 4 48.59 43.78
Retail Object Detection - meta v2.0 Tesla A30 8.5.2.2 8 181.91 43.98
Retail Object Detection - binary v1.1 Jetson AGX Orin 64GB 8.5.2.2 1 27.12 36.88
Retail Object Detection - binary v1.1 Jetson AGX Orin 64GB 8.5.2.2 4 93.59 42.74
Retail Object Detection - binary v1.1 Jetson AGX Orin 64GB 8.5.2.2 8 184.68 43.32
Retail Object Detection - binary v1.1 Tesla A30 8.5.2.2 1 8.29 120.6
Retail Object Detection - binary v1.1 Tesla A30 8.5.2.2 4 26.4 151.54
Retail Object Detection - binary v1.1 Tesla A30 8.5.2.2 8 50.01 159.98
Retail Object Detection - binary v1.0 Jetson AGX Orin 64GB 8.4.0.1 1 10.43 96
Retail Object Detection - binary v1.0 Jetson AGX Orin 64GB 8.4.0.1 16 131.79 121
Retail Object Detection - binary v1.0 Jetson AGX Orin 64GB 8.4.0.1 32 258.44 124
Retail Object Detection - binary v1.0 Tesla A10 8.4.0.1 1 4.27 234
Retail Object Detection - binary v1.0 Tesla A10 8.4.0.1 16 44.94 356
Retail Object Detection - binary v1.0 Tesla A10 8.4.0.1 64 174.46 367
Retail Object Detection - 100-class v1.0 Jetson AGX Orin 64GB 8.4.0.1 1 10.94 91
Retail Object Detection - 100-class v1.0 Jetson AGX Orin 64GB 8.4.0.1 16 140.94 114
Retail Object Detection - 100-class v1.0 Jetson AGX Orin 64GB 8.4.0.1 32 279.59 114
Retail Object Detection - 100-class v1.0 Tesla A10 8.4.0.1 1 4.46 224
Retail Object Detection - 100-class v1.0 Tesla A10 8.4.0.1 16 47.81 335
Retail Object Detection - 100-class v1.0 Tesla A10 8.4.0.1 64 187.54 338

How to Use this model

Instructions for Using an Unpruned Model with TAO

To use these models as pre-trained weights for transfer learning, use the following the template. This snippet pertains to the model component of the experiment spec file. It is designed to train an EfficientDet or DINO model. For a more comprehensive understanding of the experiment spec file, review the TAO Toolkit User Manual:

Efficientdet-D5 Models
% spec file
model:
  name: 'efficientdet-d5'
  input_height: 544
  input_width: 960
dataset:
  loader:
    prefetch_size: 4
    shuffle_file: False
  max_instances_per_image: 100
  num_classes: 2
  train_tfrecords:
    - /path/to/train/tfrecords/files # can be prepared by TAO entrypoints
  val_tfrecords:
    - /path/to/validation/tfrecords/files # can be prepared by TAO entrypoints
  val_json_file: /path/to/validation/coco/json/file
  augmentation:
    ...
train:
  ...
  checkpoint: /path/to/tlt/checkpoint
  num_examples_per_epoch: # number of training examples each cycle
  num_epochs: 150
  ...
evaluate:
  ...
  num_samples: # number of validation examples each cycle
results_dir: /path/to/output/dir
encryption_key: 'nvidia_tao'

To train the Efficientdet model, run the following:

tao model efficientdet_tf2 train -e=<experiment spec file>
DINO-FAN_base Models
% spec file
model:
  pretrained_backbone_path: /path/to/pth/weights
  backbone: fan_base
  train_backbone: True
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 900
  dropout_ratio: 0.0
  dim_feedforward: 2048

dataset:
  train_sampler: default_sampler
  train_data_sources:
    - image_dir: /path/to/train/image/dir
      json_file: /path/to/train/coco/json/file
  val_data_sources:
    - image_dir: /path/to/validation/image/dir
      json_file: /path/to/validation/coco/json/file
  num_classes: 2
  ...

train: 
  ...
  distributed_strategy: ddp
  resume_training_checkpoint_path: /path/to/pth/checkpoint

To train the DINO model, run the following:

tao model dino train -e=<experiment spec file>

Instructions for Deploying these Models with TAO Deploy

Retail Object Detection models ONNX files can be consumed by TensorRT using tao deploy. This snippet pertains to the model component of the experiment spec file, aimed at deploying an EfficientDet or DINO model. For a more comprehensive understanding of the experiment spec file, review Deploying with TAO Deploy:

Efficientdet-D5 Models
  1. Convert the ONNX file to a TensorRT engine file. Prepare a gen_trt_engine configuration file similar to the following example:
% expriment spec file
model:
  name: 'efficientdet-d5'
  input_height: 544
  input_width: 960
gen_trt_engine:
  trt_engine: /path/to/converted/trt/engine/file
  onnx_file: /path/to/onnx/file
  tensorrt:
    data_type: 'fp16' 
    min_batch_size: 1
    opt_batch_size: 4
    max_batch_size: 4
results_dir: /path/to/output/dir
encryption_key: 'nvidia_tao'

Run

tao deploy efficientdet_tf2 gen_trt_engine -e=<experiment spec file>
  1. Run inference of the converted TensorRT engine. Prepare the inference configuration file similar to the following example:

The label_map file can be found at File Browser/deployable_binary_v1.1/class_map.txt.

% expriment spec file
model:
  name: 'efficientdet-d5'
  input_height: 544
  input_width: 960
dataset:
  loader:
    prefetch_size: 4
    shuffle_file: False
  max_instances_per_image: 100
  num_classes: 2
  train_tfrecords:
    - /path/to/train/tfrecords/files # this entry won't be used. the inputs are random here
  val_tfrecords:
    - /path/to/test/tfrecords/files # can be prepared by TAO entrypoints
  val_json_file: /path/to/test/coco/json/file
inference:
  output_dir: /path/to/inference/output
  checkpoint: /path/to/converted/trt/engine/file
  image_dir: /path/to/inference/input
  batch_size: 4
  label_map: /path/to/class/map
results_dir: /path/to/output/dir
encryption_key: 'nvidia_tao'

To run inference, run the following:

tao deploy efficientdet_tf2 inference -e=<experiment spec file>
DINO-FAN_base Models
  1. Convert the ONNX file to a TensorRT engine file. Prepare the gen_trt_engine configuration file using the following example as a guideline:
% spec file
model:
  pretrained_backbone_path: ""
  backbone: fan_base
  train_backbone: True
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 900
  dropout_ratio: 0.0
  dim_feedforward: 2048

gen_trt_engine:
  gpu_id: 0
  onnx_file: /path/to/onnx/file
  trt_engine: /path/to/converted/trt/engine
  input_channel: 3
  input_width: 960
  input_height: 544
  tensorrt:
    data_type: fp32
    workspace_size: 1024
    min_batch_size: 1
    opt_batch_size: 10
    max_batch_size: 10

Run:

tao deploy dino gen_trt_engine -e=<experiment spec file>
  1. Run inference of the converted TensorRT engine. The following is an example of an inference configuration file for Retail Object Detection - binary v2.0.

The class_map can be found at File Browser/deployable_binary_v2.0/class_map.txt but you must remove the background class line for TAO Deploy inference. In the case of Retail Object Detection - binary v2.0, the ending class_map.txt is:

retail item

The following is the example inference configuration file for a DINO TensorRT engine file:

% spec file
model:
  pretrained_backbone_path: ""
  backbone: fan_base
  train_backbone: True
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 900
  dropout_ratio: 0.0
  dim_feedforward: 2048
dataset:
  train_sampler: default_sampler
  infer_data_sources:
    image_dir: 
      - /path/to/inference/input
    classmap: /path/to/class/map
  num_classes: 2
  batch_size: 10
  ...
inference:
  trt_engine: /path/to/converted/trt/engine
  conf_threshold: 0.5
  input_width: 960
  input_height: 544
  color_map:
    retail item: green
results_dir: /path/to/output/dir

To run infernece of the DINO model, run:

tao deploy dino inference -e=<experiment spec file>

Deploying these Models with DeepStream

This section describes the steps for deploying v1.1+ models on the DeepStream platform.

Input

RGB Image Dimensions: 960 X 544 X 3 (W x H x C)

Channel Ordering of the Input: NCHW

Where N = Batch Size, C = number of channels (3), H = Height of images (544), W = Width of the images (960).

Output

Category labels and bounding-box coordinates for each detected retail item in the input image.

You can obtain examples to use the Retail Item Detector for a PGIE video analytic application. This section provides instruction for how to deploy these models with DeepStream SDK. DeepStream SDK is a streaming analytic toolkit to accelerate building AI-based video analytic applications. DeepStream supports direct integration of these models into the deepstream sample app.

To deploy these models with DeepStream 6.2:

  1. Download and install DeepStream SDK. The installation instructions for DeepStream are provided in DeepStream development guide.

The config files for the purpose-built models are located in the default DeepStream installation directory:

/opt/nvidia/deepstream 

This path might vary, if you are installing in a different directory.

  1. Review the sample config files that are provided in NVIDIA-AI-IOT. Assume the repo is cloned under $DS_TAO_APPS_HOME, in $DS_TAO_APPS_HOME/configs/nvinfer/retail_object_detection_tao:
# Efficientdet binary-class detector (the primary GIE) inference setting 
pgie_retail_object_detection_binary_effdet_tao_config.txt
pgie_retail_object_detection_binary_effdet_tao_config.yml

# DINO-FAN binary-class detector (the primary GIE) inference setting 
pgie_retail_object_detection_binary_dino_tao_config.txt
pgie_retail_object_detection_binary_dino_tao_config.yaml

# DINO-FAN meta-class detector (the primary GIE) inference setting 
pgie_retail_object_detection_meta_dino_tao_config.txt
pgie_retail_object_detection_meta_dino_tao_config.yaml
EfficientDet

Key Parameters in pgie_retail_object_detection_binary_effdet_tao_config.txt are:

[property]
gpu-id=0
net-scale-factor=1.0
offsets=0;0;0
model-color-format=1
tlt-model-key=nvidia_tao
tlt-encoded-model=/path/to/binary/retail/detector/effdet/etlt/file
model-engine-file=/path/to/meta/retail/detector/effdet/trt/engine # only one of model-engine-file and tlt-encoded-model is needed
labelfile-path=/path/to/binary/retail/detector/label/file
network-input-order=0
infer-dims=3;544;960
maintain-aspect-ratio=1
batch-size=1
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=2
network-type=0
network-input-order=1
num-detected-classes=1
interval=0
gie-unique-id=1
cluster-mode=4
output-blob-names=num_detections;detection_boxes;detection_scores;detection_classes
parse-bbox-func-name=NvDsInferParseCustomEfficientDetTAO
custom-lib-path=$DS_TAO_APPS_HOME/post_processor/libnvds_infercustomparser_tao.so

#Use the config params below for NMS clustering mode
[class-attrs-all]
pre-cluster-threshold=0.8
roi-top-offset=0
roi-bottom-offset=0
detected-min-w=0
detected-min-h=0 
detected-max-w=0
detected-max-h=0
DINO

Key Parameters in pgie_retail_object_detection_binary_dino_tao_config.txt are:

[property]
net-scale-factor=0.01735207357279195
offsets=123.675;116.28;103.53
model-color-format=1
onnx-file=/path/to/binary/retail/detector/dino/onnx/file
model-engine-file=/path/to/binary/retail/detector/dino/trt/engine # only one of onnx-file and model-engine-file is needed
labelfile-path=/path/to/binary/retail/detector/label/file
network-input-order=0
infer-dims=3;544;960
maintain-aspect-ratio=1
batch-size=1
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=2
num-detected-classes=2
interval=0
gie-unique-id=1
cluster-mode=4
output-blob-names=pred_boxes;pred_logits
parse-bbox-func-name=NvDsInferParseCustomDDETRTAO
custom-lib-path=../../../post_processor/libnvds_infercustomparser_tao.so


#Use the config params below for NMS clustering mode
[class-attrs-all]
pre-cluster-threshold=0.5
topk=300
Decode

To decode the bounding box information from the DINO or EfficientDet output tensor, the custom parser function and library have to be specified.

To run inference the model:

cd $DS_TAO_APPS_HOME/configs/nvinfer/retail_object_detection_tao
$DS_TAO_APPS_HOME/apps/tao_detection/ds-tao-detection -c <config file path> -i file://$DS_TAO_APPS_HOME/samples/streams/retail_object_h264.mp4

For more information, see Integrating TAO Models into DeepStream.

Input Image

The logos on retail items were smudged.

Output Image

The logos on retail items were smudged.

Limitations

Very Small and Crowded Objects

The NVIDIA Retail Object Detection model was trained to detect objects larger than 10x10 pixels. Therefore it might not be able to detect objects that are smaller than 10x10 pixels. Having your target objects take >10% of the frame is suggested, so that the model is less likely to be distracted from the noise in the backgrounds.

The Retail Object Detection model was trained on images having one target item per frame, mimicking the retail checkout scene in the real world. Having one object per frame is suggested. Having multiple objects in one frame might challenge the Retail Object Detection model.

Occluded Objects

When objects are occluded or truncated, such that less than 40% of the object is visible, they might not be detected by the Retail Object Detection model. Partial occlusion by hand is acceptable, because the model was trained with hand occlusion examples.

Monochrome or Infrared Camera Images

The Retail Object Detection models were trained on RGB images. Images captured by monochrome or IR cameras might not provide good detection results.

Warped and Blurry Images

The Retail Object Detection models were not trained on fish-eye lens cameras or moving cameras. The models might not perform well for warped images and images that have motion-induced blurs.

Noisy Backgrounds

Although the Retail Object Detection models were trained in diverse environments, without the fine-tuning on the target environment, the models perform better on images with a clean background. Examples of a clean background are a checkout plate or a conveyor belt. Reducing the complex textures in the background as much as possible is recommended.

Model Versions

  • trainable_100_v1.0
  • deployable_100_v1.0
  • trainable_binary_v1.0
  • deployable_binary_v1.0
  • trainable_binary_eff-d5_v2.0
  • deployable_binary_eff-d5_v2.0
  • trainable_binary_v2.1
  • deployable_binary_v2.1
  • trainable_binary_v2.1.1
  • deployable_binary_v2.1.1

References

Citations

  • Tan, Mingxing, Ruoming Pang, and Quoc V. Le. "Efficientdet: Scalable and efficient object detection." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

  • Carion, Nicolas, et al. "End-to-end object detection with transformers." European conference on computer vision. Cham: Springer International Publishing, 2020.

  • Zhang, Hao, et al. "Dino: Detr with improved denoising anchor boxes for end-to-end object detection." arXiv preprint arXiv:2203.03605 (2022).

  • Zhou, Daquan, et al. "Understanding the robustness in vision transformers." International Conference on Machine Learning. PMLR, 2022.

Using TAO Pre-Trained Models

Technical Blogs

Suggested Reading

License

License to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.

Ethical Considerations

NVIDIA Retail Object Detection model detects retail items. However, no additional information such as people and other distractors in the background are inferred. Training and evaluation dataset mostly consists of North American content. An ideal training and evaluation dataset would additionally include content from other geographies. NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.