The models described in this card detect retail items within an image and return a bounding box around each detected item. The retail items are typically packaged commercial goods with barcodes and ingredients labels on them.
Three types of retail object detection models are provided here:
The models are trained on synthetic data and finetuned on a small number of real images. Therefore, this model card also gives a recipe for leveraging the power of synthetic data in task-specific model development.
Model Version | Detection tasks | Few-shot KPI AP50 | Zero-shot KPI AP50 | FPS on Jetson AGX Orin 64GB, batch size = 1 |
---|---|---|---|---|
Retail Object Detection - binary v2.2.2.3 | Binary detection | 0.971 | 0.986 | 15.75 |
Retail Object Detection - binary v2.2.2.2 | Binary detection | NA | 0.986 | 15.72 |
Retail Object Detection - binary v2.2.2.1 | Binary detection | NA | 0.982 | 15.82 |
Retail Object Detection - binary v2.2.1.3 | Binary detection | 0.973 | 0.986 | 11.9 |
Retail Object Detection - binary v2.2.1.2 | Binary detection | NA | 0.986 | 11.89 |
Retail Object Detection - binary v2.2.1.1 | Binary detection | NA | 0.978 | 11.87 |
Retail Object Detection - binary v2.1.2 | Binary detection | 0.969 | 0.972 | 11.7 |
Retail Object Detection - binary v2.1.1 | Binary detection | 0.968 | 0.971 | 15.56 |
Retail Object Detection - meta v2.0 | Meta class detection | 0.903 | NA | 8.1 |
Retail Object Detection - binary v2.0 | Binary detection | 0.967 | 0.94 | 8.09 |
Retail Object Detection - binary v1.1 | Binary detection | 0.956 | 0.733 | 36.88 |
Retail Object Detection - 100-class v1.0 | 100-class detection | 0.779 | NA | 91 |
Retail Object Detection - binary v1.0 | Binary detection | 0.940 | 0.168 | 96 |
The following table chronicles the progression of Retail Object Detection models across various versions:
Model Version | Model Architecture | Input Resolution | Model Size | Number of Classes | Training Data | Fine-Tune Data | Decryption Code |
---|---|---|---|---|---|---|---|
Retail Object Detection - binary v2.2.2.3 | DINO-FAN_base | Dynamic | 73.0 M | 1 | 226k synthetic data | 2211 real images | None |
Retail Object Detection - binary v2.2.2.2 | DINO-FAN_base | Dynamic | 73.0 M | 1 | 226k synthetic data | 2862 synthetic images | None |
Retail Object Detection - binary v2.2.2.1 | DINO-FAN_base | Dynamic | 73.0 M | 1 | 226k synthetic data | NA | None |
Retail Object Detection - binary v2.2.1.3 | DINO-FAN_small | Dynamic | 48.3 M | 1 | 226k synthetic data | 2211 real images | None |
Retail Object Detection - binary v2.2.1.2 | DINO-FAN_small | Dynamic | 48.3 M | 1 | 226k synthetic data | 2862 synthetic images | None |
Retail Object Detection - binary v2.2.1.1 | DINO-FAN_small | Dynamic | 48.3 M | 1 | 226k synthetic data | NA | None |
Retail Object Detection - binary v2.1.2 | DINO-FAN_base | Dynamic | 73.0 M | 1 | 226k synthetic data | 642 real images | None |
Retail Object Detection - binary v2.1.1 | DINO-FAN_small | Dynamic | 48.3 M | 1 | 226k synthetic data | 642 real images | None |
Retail Object Detection - meta v2.0 | DINO-FAN_base | Dynamic | 73.0 M | 10 | 320k synthetic data | 1123 real images | None |
Retail Object Detection - binary v2.0 | DINO-FAN_base | Dynamic | 73.0 M | 1 | 320k synthetic data | 1123 real images | None |
Retail Object Detection - binary v1.1 | Efficientdet-D5 | 960x544 | 33.7M | 1 | 320k synthetic data | 1123 real images | nvidia_tao |
Retail Object Detection - 100-class v1.0 | Efficientdet-D5 | 416x416 | 33.9M | 100 | 1.5M synthetic data | 518 real images. | nvidia_tlt |
Retail Object Detection - binary v1.0 | Efficientdet-D5 | 416x416 | 33.7M | 1 | 1.5M synthetic data | 518 real images | nvidia_tlt |
Two distinct model architectures are used across different versions: EfficientDet-D5 and DINO-FAN_base.
For models based on EfficientDet-D5 and EfficientDet-D0 architecture.
EfficientDet is a one-stage detector with the following architecture components:
For models based on DINO-FAN architecture:
The DINO (DETR with Improved deNoising anchOrboxes) model is a state-of-the-art end-to-end object detector that improves on previous DETR-like models in performance and efficiency. It introduces several novel techniques, including contrastive DN training, mixed query selection, and looking forward twice. DINO scales well for model size and data size.
Either FAN_base
or FAN_small
is used as the backbone. FAN is the Fully Attentional Network that enhances the traditional transformer. A FAN block has both token self-attention and channel attention applied, making the entire network fully attentional. The linear projection layer after the channel attention is removed.
A FAN base model has 18 FAN blocks and their channel dimensions are 448 and they have 8 heads.
A FAN small model has 12 FAN blocks and their channel dimensions are 384 and they have 8 heads.
For more information on model architecture and features, see the literature studies at Citation part.
The models are trained using the efficientdet_tf2
or dino
entrypoints in TAO. The trainings are carried out in two phases. In the first phase, the networks loaded with pretrained weights are trained on a large number of synthetic data. In the second phase, the networks are fine-tuned using a small number of samples.
This training dataset has several advantages over publicly available retail datasets, including:
Model Version | Training data | Finetune data | #classes | |||||||
---|---|---|---|---|---|---|---|---|---|---|
# training images | # retail subjects | # scenes | average object density / frame | # training images | # retail subjects | # scenes | average object density / frame | images type | 1 | |
Retail Object Detection - binary v2.2.2.3 | 226,730 | 315 | 226,730 | 5.26 | 2211 | 891 | 25 | 1.32 | real | 1 |
Retail Object Detection - binary v2.2.2.2 | 226,730 | 315 | 226,730 | 5.26 | 2862 | 315 | 4 | 1.20 | synthetic | 1 |
Retail Object Detection - binary v2.2.2.1 | 226,730 | 315 | 226,730 | 5.26 | NA | NA | NA | NA | NA | 1 |
Retail Object Detection - binary v2.2.1.3 | 226,730 | 315 | 226,730 | 5.26 | 2211 | 891 | 25 | 1.32 | real | 1 |
Retail Object Detection - binary v2.2.1.2 | 226,730 | 315 | 226,730 | 5.26 | 2862 | 315 | 4 | 1.20 | synthetic | 1 |
Retail Object Detection - binary v2.2.1.1 | 226,730 | 315 | 226,730 | 5.26 | NA | NA | NA | NA | NA | 1 |
Retail Object Detection - binary v2.1.2 | 226,730 | 315 | 226,730 | 5.26 | 642 | 408 | 40 | 1.35 | real | 1 |
Retail Object Detection - binary v2.1.1 | 226,730 | 315 | 226,730 | 5.26 | 642 | 408 | 40 | 1.35 | real | 1 |
Retail Object Detection - meta v2.0 | 265,806 | 315 | 265,806 | 7.18 | 952 | 315 | 7 | 1 | real | 10 |
Retail Object Detection - binary v2.0 | 265,806 | 315 | 265,806 | 7.18 | 952 | 315 | 7 | 1 | real | 1 |
Retail Object Detection - binary v1.1 | 1,425,000 | 100 | 1,425,000 | 1 | 518 | 100 | 6 | 1 | real | 1 |
Retail Object Detection - 100-class v1.0 | 1,425,000 | 100 | 1,425,000 | 1 | 518 | 100 | 6 | 1 | real | 100 |
Retail Object Detection - binary v1.0 | 1,425,000 | 100 | 1,425,000 | 1 | 518 | 100 | 6 | 1 | real | 1 |
We did not finetune v2.2.1.1 and v2.2.2.1. The models are provided to users to prepare finetune data for their own downstream tasks. v2.2.1.2 and v2.2.2.2 are finetuned on synthetic data tailored for downstream tasks. v2.2.1.3 and v2.2.2.3 are finetuned on real images.
Below we introduce the synthetic data and real data used in latest released v2.2.x.x models.
We used Omniverse Replicator Object to generate synthetic data.
Each frame is composed of a 2D image background, with a 3D retail object inserted. The background textures are real images sampled from proprietary real images.
Each synthetic image contains 1-20 target retail item.
The synethetic data randomizes several simulation domains, including:
We also found good results when using purely synthetic data in both training and finetune. The finetune data is tailored for the downstream tasks, such as detecting retail objects on table or crowded grocery store shelves.
The fine tuning data is captured under random camera heights and field of views. All fine tuning data was collected indoors, having retail items placed on the checkout counter, shelf, baskets, conveyor belts, grocery stores, or home. For scenes shot by cameras at fix points, the camera is typically set up at approximately 10 feet height, 45-degree, 90 degree and 180-degree angles off the vertical axis and has close field-of-view. Others are shot by phone cameras at random angles. This content was chosen to decrease the simulation-to-reality gap of the model trained on synthetic data, and to improve the accuracy and the robustness of the model. The logos on retail items were smudged.
100-class detection model detects specific 100 retail subjects and returns their subject names.
The list of 100 retail subjects is at The 100 retail subject table.
Retail Object Detection - meta v2.0 model is offered. The following table has the meta class distributions of the train and fine-tune data:
Meta Class | Train #instances | Finetune #instances | Percentage |
---|---|---|---|
oval container | 36549 | 14 | 1% |
cylindrical container | 122583 | 73 | 8% |
bottle container | 92514 | 44 | 5% |
round container | 0 | 0 | 0% |
box container | 264973 | 627 | 66% |
Rectangular prism with protrusion | 64063 | 26 | 3% |
shallow rectangular prism | 75571 | 32 | 3% |
modified cylindrical container with short neck | 11335 | 4 | 0% |
bag container | 99008 | 60 | 6% |
miscellaneous container | 113803 | 72 | 8% |
Total | 880399 | 952 | 100% |
* The total #instances is >= total #images because each frame is likely to contain more than one retail instance.
The fine-tuning data is created by labeling ground-truth bounding-boxes and categories by human-labelers. The following guidelines were used while labeling the training data for NVIDIA Retail Object Detection models. If you are looking to transfer-learn or to fine-tune the models to adapt to your target environment and classes, use the following the guidelines for better model accuracy:
The performance of the Retail Object Detection model was gauged using three sets of KPI data.
The details of each set of KPI data are listed in the following table:
Scenes | Retail Objects | # Test Images | |
---|---|---|---|
Voyager Cafe | 7 | 315 | 6,042 |
Retail Product Checkout data | 1 | 200 | 24,000 |
Crowdsourcing | 15 | 510 | 1,988 |
The Voyager Cafe KPI data is collected at NVIDIA Corporation's head quarter's cafeteria. The 7 scenes are included in the finetune set as well. Therefore, the Voyager Cafe evaluation are few-shot results.
* The Retail Product Checkout dataset (RPC dataset) is a public dataset. It contains 200 retail subjects in its test data. TAO tests the binary Retail Object Detection models on it by updating its test data annotation file to binary class.
There are a few retail subjects in the RPC dataset that do not match the TAO definition in retail items, such as pencil cases (as it does not have commercial packages or barcode attached to it). It is expected that the zero-shot test of the binary Retail Object Detection models misses such items. The evaluation results on the Retail Product Checkout dataset can be regarded as zero-shot.
The Crowdsourcing KPI data was collected from 15 different scenes including grocery stores, home kitchens, at other arbitrary places. This KPI data has much larger variances than the Voyager Cafe and RPC dataset. We make sure there is no overlaping scenes in finetune data, hence the Crowdsourcing evaluation results are zero-shot.
AP50 is calculated using intersection-over-union (IOU) criterion greater than 0.5. The KPI for the evaluation data are reported in the following table. The model is evaluated based on AP50 and AR0.5:0.95. Both AR and AP numbers are based on 100 maximum detections each image.
Model | Voyager Cafe | |
---|---|---|
AP0.5 | AR0.5:0.95 | |
Retail Object Detection - binary v2.2.2.3 | 0.971 | 0.953 |
Retail Object Detection - binary v2.2.1.3 | 0.966 | 0.949 |
Retail Object Detection - binary v2.1.2 | 0.969 | 0.955 |
Retail Object Detection - binary v2.1.1 | 0.968 | 0.959 |
Retail Object Detection - binary v2.0 | 0.967 | 0.898 |
Retail Object Detection - binary v1.1 | 0.956 | 0.889 |
Retail Object Detection - binary v1.0 | 0.94 | 0.86 |
Model | RPC Dataset | Crowdsource KPI | ||
---|---|---|---|---|
AP0.5 | AR0.5:0.95 | AP0.5 | AR0.5:0.95 | |
Retail Object Detection - binary v2.2.2.3 | 0.986 | 0.847 | 0.969 | 0.99 |
Retail Object Detection - binary v2.2.2.2 | 0.986 | 0.841 | 0.932 | 0.983 |
Retail Object Detection - binary v2.2.2.1 | 0.982 | 0.846 | 0.82 | 0.969 |
Retail Object Detection - binary v2.2.1.3 | 0.981 | 0.852 | 0.956 | 0.984 |
Retail Object Detection - binary v2.2.1.2 | 0.986 | 0.847 | 0.921 | 0.979 |
Retail Object Detection - binary v2.2.1.1 | 0.978 | 0.853 | 0.834 | 0.965 |
Retail Object Detection - binary v2.1.2 | 0.972 | 0.845 | 0.956 | 0.982 |
Retail Object Detection - binary v2.1.1 | 0.971 | 0.847 | 0.943 | 0.97 |
Retail Object Detection - binary v2.0 | 0.94 | 0.719 | 0.803 | 0.883 |
Retail Object Detection - binary v1.1 | 0.733 | 0.631 | 0.532 | 0.598 |
Retail Object Detection - binary v1.0 | 0.168 | 0.266 | 0.578 | 0.597 |
Meta v2.0 model is evaluated on our internal Voyager Cafe data for few-shot evaluation.
The following is the classwise accuracy of each meta class:
meta class | AP0.5 | AR0.5:0.95 |
---|---|---|
oval container | 0.949 | 0.966 |
cylindrical container | 0.913 | 0.902 |
bottle container | 0.857 | 0.937 |
box container | 0.972 | 0.967 |
Rectangular prism with protrusion | 0.98 | 0.994 |
shallow rectangular prism | 1.00 | 1.00 |
modified cylindrical container with short neck | 0.604 | 0.56 |
bag container | 0.927 | 0.916 |
miscellaneous container | 0.922 | 0.945 |
overall | 0.903 | 0.910 |
* Because the round container is not included in our KPI data, the accuracy table does not include it.
Retail Object Detection - 100-Class v1.0 model is tested on 6 scenes in Voyager Cafe KPI data. The following is the breakdown of the model performance in each scene:
Scene | Seen Items Result (AP50) | Seen Items Result (AR MaxDets=100) |
---|---|---|
Voyager Cafe - checkout counter 45 overhead | 0.564 | 0.741 |
Voyager Cafe - shelf | 0.933 | 0.860 |
Voyager Cafe - conveyor belt | 0.872 | 0.888 |
Voyager Cafe - basket | 0.536 | 0.722 |
Voyager Cafe - checkout counter barcode scanner view | 0.845 | 0.758 |
Voyager Cafe - checkout counter overhead | 0.926 | 0.859 |
average | 0.779 | 0.805 |
The inference is run on the provided unpruned model at FP16 precision. The model input resolution is 960x544. The inference performance is measured using the trtexec on Jetson AGX Orin 64GB and A10. The following performance table, only captures latency of the forward pass inference with the model. The end-to-end performance with streaming video data might vary depending on other bottlenecks in the hardware and software.
model | device | TensorRT Version | batch size | latency (ms) | images per second |
---|---|---|---|---|---|
Retail Object Detection - binary v2.2.2.1-3 | DGX H100 80GB | 8.6.3 | 1 | 8.78 | 114.18 |
Retail Object Detection - binary v2.2.2.1-3 | DGX H100 80GB | 8.6.3 | 32 | 189.18 | 169.29 |
Retail Object Detection - binary v2.2.2.1-3 | DGX H100 80GB | 8.6.3 | 64 | 374.58 | 170.93 |
Retail Object Detection - binary v2.2.2.1-3 | Jetson AGX Orin 64GB | 8.6.2.3 | 1 | 84.26 | 11.87 |
Retail Object Detection - binary v2.2.2.1-3 | Jetson AGX Orin 64GB | 8.6.2.3 | 4 | 326.05 | 12.27 |
Retail Object Detection - binary v2.2.2.1-3 | Jetson AGX Orin 64GB | 8.6.2.3 | 8 | 655.77 | 12.20 |
Retail Object Detection - binary v2.2.1.1-3 | DGX H100 80GB | 8.6.3 | 1 | 6.73 | 149.01 |
Retail Object Detection - binary v2.2.1.1-3 | DGX H100 80GB | 8.6.3 | 32 | 143.93 | 222.48 |
Retail Object Detection - binary v2.2.1.1-3 | DGX H100 80GB | 8.6.3 | 64 | 284.65 | 224.93 |
Retail Object Detection - binary v2.2.1.1-3 | Jetson AGX Orin 64GB | 8.6.2.3 | 1 | 63.21 | 15.82 |
Retail Object Detection - binary v2.2.1.1-3 | Jetson AGX Orin 64GB | 8.6.2.3 | 4 | 251.79 | 15.89 |
Retail Object Detection - binary v2.2.1.1-3 | Jetson AGX Orin 64GB | 8.6.2.3 | 8 | 500.96 | 15.97 |
Retail Object Detection - binary v2.1.2 | Jetson AGX Orin 64GB | 8.6.2.3 | 1 | 85.48 | 11.7 |
Retail Object Detection - binary v2.1.2 | Jetson AGX Orin 64GB | 8.6.2.3 | 4 | 334.23 | 11.97 |
Retail Object Detection - binary v2.1.2 | Jetson AGX Orin 64GB | 8.6.2.3 | 8 | 673.31 | 11.88 |
Retail Object Detection - binary v2.1.2 | DGX H100 80GB | 8.6.2.3 | 1 | 9.08 | 110.14 |
Retail Object Detection - binary v2.1.2 | DGX H100 80GB | 8.6.2.3 | 16 | 99.54 | 160.74 |
Retail Object Detection - binary v2.1.2 | DGX H100 80GB | 8.6.2.3 | 32 | 193.61 | 165.28 |
Retail Object Detection - binary v2.1.1 | Jetson AGX Orin 64GB | 8.6.2.3 | 1 | 64.28 | 15.56 |
Retail Object Detection - binary v2.1.1 | Jetson AGX Orin 64GB | 8.6.2.3 | 4 | 130.49 | 15.44 |
Retail Object Detection - binary v2.1.1 | Jetson AGX Orin 64GB | 8.6.2.3 | 8 | 517.43 | 15.46 |
Retail Object Detection - binary v2.1.1 | DGX H100 80GB | 8.6.2.3 | 1 | 6.96 | 143.61 |
Retail Object Detection - binary v2.1.1 | DGX H100 80GB | 8.6.2.3 | 16 | 76.03 | 210.45 |
Retail Object Detection - binary v2.1.1 | DGX H100 80GB | 8.6.2.3 | 32 | 148.13 | 216.03 |
Retail Object Detection - binary v2.0 | Jetson AGX Orin 64GB | 8.5.2.2 | 1 | 123.63 | 8.09 |
Retail Object Detection - binary v2.0 | Jetson AGX Orin 64GB | 8.5.2.2 | 4 | 468.38 | 8.54 |
Retail Object Detection - binary v2.0 | Jetson AGX Orin 64GB | 8.5.2.2 | 8 | 981.46 | 8.15 |
Retail Object Detection - binary v2.0 | Tesla A30 | 8.5.2.2 | 1 | 25.25 | 39.61 |
Retail Object Detection - binary v2.0 | Tesla A30 | 8.5.2.2 | 4 | 91.94 | 43.51 |
Retail Object Detection - binary v2.0 | Tesla A30 | 8.5.2.2 | 8 | 180.81 | 44.25 |
Retail Object Detection - meta v2.0 | Jetson AGX Orin 64GB | 8.5.2.2 | 1 | 123.52 | 8.1 |
Retail Object Detection - meta v2.0 | Jetson AGX Orin 64GB | 8.5.2.2 | 4 | 468.19 | 8.54 |
Retail Object Detection - meta v2.0 | Jetson AGX Orin 64GB | 8.5.2.2 | 8 | 930.6 | 8.6 |
Retail Object Detection - meta v2.0 | Tesla A30 | 8.5.2.2 | 1 | 25.34 | 39.47 |
Retail Object Detection - meta v2.0 | Tesla A30 | 8.5.2.2 | 4 | 48.59 | 43.78 |
Retail Object Detection - meta v2.0 | Tesla A30 | 8.5.2.2 | 8 | 181.91 | 43.98 |
Retail Object Detection - binary v1.1 | Jetson AGX Orin 64GB | 8.5.2.2 | 1 | 27.12 | 36.88 |
Retail Object Detection - binary v1.1 | Jetson AGX Orin 64GB | 8.5.2.2 | 4 | 93.59 | 42.74 |
Retail Object Detection - binary v1.1 | Jetson AGX Orin 64GB | 8.5.2.2 | 8 | 184.68 | 43.32 |
Retail Object Detection - binary v1.1 | Tesla A30 | 8.5.2.2 | 1 | 8.29 | 120.6 |
Retail Object Detection - binary v1.1 | Tesla A30 | 8.5.2.2 | 4 | 26.4 | 151.54 |
Retail Object Detection - binary v1.1 | Tesla A30 | 8.5.2.2 | 8 | 50.01 | 159.98 |
Retail Object Detection - binary v1.0 | Jetson AGX Orin 64GB | 8.4.0.1 | 1 | 10.43 | 96 |
Retail Object Detection - binary v1.0 | Jetson AGX Orin 64GB | 8.4.0.1 | 16 | 131.79 | 121 |
Retail Object Detection - binary v1.0 | Jetson AGX Orin 64GB | 8.4.0.1 | 32 | 258.44 | 124 |
Retail Object Detection - binary v1.0 | Tesla A10 | 8.4.0.1 | 1 | 4.27 | 234 |
Retail Object Detection - binary v1.0 | Tesla A10 | 8.4.0.1 | 16 | 44.94 | 356 |
Retail Object Detection - binary v1.0 | Tesla A10 | 8.4.0.1 | 64 | 174.46 | 367 |
Retail Object Detection - 100-class v1.0 | Jetson AGX Orin 64GB | 8.4.0.1 | 1 | 10.94 | 91 |
Retail Object Detection - 100-class v1.0 | Jetson AGX Orin 64GB | 8.4.0.1 | 16 | 140.94 | 114 |
Retail Object Detection - 100-class v1.0 | Jetson AGX Orin 64GB | 8.4.0.1 | 32 | 279.59 | 114 |
Retail Object Detection - 100-class v1.0 | Tesla A10 | 8.4.0.1 | 1 | 4.46 | 224 |
Retail Object Detection - 100-class v1.0 | Tesla A10 | 8.4.0.1 | 16 | 47.81 | 335 |
Retail Object Detection - 100-class v1.0 | Tesla A10 | 8.4.0.1 | 64 | 187.54 | 338 |
To use these models as pre-trained weights for transfer learning, use the following the template. This snippet pertains to the model component of the train configuration file. It is designed to train an EfficientDet or DINO model. For a more comprehensive understanding of the experiment spec file, review the TAO Toolkit User Manual:
model:
name: 'efficientdet-d5'
input_height: 544
input_width: 960
dataset:
loader:
prefetch_size: 4
shuffle_file: False
max_instances_per_image: 100
num_classes: 2
train_tfrecords:
- /path/to/train/tfrecords/files # can be prepared by TAO entrypoints
val_tfrecords:
- /path/to/validation/tfrecords/files # can be prepared by TAO entrypoints
val_json_file: /path/to/validation/coco/json/file
augmentation:
...
train:
...
checkpoint: /path/to/tlt/checkpoint
num_examples_per_epoch: # number of training examples each cycle
num_epochs: 150
...
evaluate:
...
num_samples: # number of validation examples each cycle
results_dir: /path/to/output/dir
encryption_key: 'nvidia_tao'
To train the Efficientdet model, run the following command:
tao model efficientdet_tf2 train -e=<experiment spec file>
model:
pretrained_backbone_path: /path/to/pth/weights
backbone: fan_base
train_backbone: True
num_feature_levels: 4
dec_layers: 6
enc_layers: 6
num_queries: 900
dropout_ratio: 0.0
dim_feedforward: 2048
dataset:
train_sampler: default_sampler
train_data_sources:
- image_dir: /path/to/train/image/dir
json_file: /path/to/train/coco/json/file
val_data_sources:
- image_dir: /path/to/validation/image/dir
json_file: /path/to/validation/coco/json/file
num_classes: 2
...
train:
...
distributed_strategy: ddp
resume_training_checkpoint_path: /path/to/pth/checkpoint
To train the DINO model, run the following:
tao model dino train -e=<experiment spec file>
Retail Object Detection models ONNX files can be consumed by TensorRT using tao deploy
. This snippet pertains to the model component of the experiment spec file, aimed at deploying an EfficientDet or DINO model. For a more comprehensive understanding of the experiment spec file, review Deploying with TAO Deploy:
TensorRT
engine file. Prepare a gen_trt_engine
configuration file similar to the following example:% expriment spec file
model:
name: 'efficientdet-d5'
input_height: 544
input_width: 960
gen_trt_engine:
trt_engine: /path/to/converted/trt/engine/file
onnx_file: /path/to/onnx/file
tensorrt:
data_type: 'fp16'
min_batch_size: 1
opt_batch_size: 4
max_batch_size: 4
results_dir: /path/to/output/dir
encryption_key: 'nvidia_tao'
Run
tao deploy efficientdet_tf2 gen_trt_engine -e=<experiment spec file>
inference
configuration file similar to the following example:The label_map
file can be found at File Browser/deployable_binary_v1.1/class_map.txt.
% expriment spec file
model:
name: 'efficientdet-d5'
input_height: 544
input_width: 960
dataset:
loader:
prefetch_size: 4
shuffle_file: False
max_instances_per_image: 100
num_classes: 2
train_tfrecords:
- /path/to/train/tfrecords/files # this entry won't be used. the inputs are random here
val_tfrecords:
- /path/to/test/tfrecords/files # can be prepared by TAO entrypoints
val_json_file: /path/to/test/coco/json/file
inference:
output_dir: /path/to/inference/output
checkpoint: /path/to/converted/trt/engine/file
image_dir: /path/to/inference/input
batch_size: 4
label_map: /path/to/class/map
results_dir: /path/to/output/dir
encryption_key: 'nvidia_tao'
To run inference, run the following:
tao deploy efficientdet_tf2 inference -e=<experiment spec file>
TensorRT
engine file. Prepare the gen_trt_engine
configuration file using the following example as a guideline:model:
pretrained_backbone_path: ""
backbone: fan_base
train_backbone: True
num_feature_levels: 4
dec_layers: 6
enc_layers: 6
num_queries: 900
dropout_ratio: 0.0
dim_feedforward: 2048
gen_trt_engine:
gpu_id: 0
onnx_file: /path/to/onnx/file
trt_engine: /path/to/converted/trt/engine
input_channel: 3
input_width: 960
input_height: 544
tensorrt:
data_type: fp32
workspace_size: 1024
min_batch_size: 1
opt_batch_size: 10
max_batch_size: 10
Run:
tao deploy dino gen_trt_engine -e=<experiment spec file>
inference
configuration file for Retail Object Detection - binary v2.0.The class_map
can be found at File Browser/deployable_binary_v2.0/class_map.txt but you must remove the background class line for TAO Deploy inference. For Retail Object Detection - binary v2.0
, the ending class_map.txt
is:
retail item
The following is the example inference
configuration file for a DINO TensorRT engine file:
model:
pretrained_backbone_path: ""
backbone: fan_base
train_backbone: True
num_feature_levels: 4
dec_layers: 6
enc_layers: 6
num_queries: 900
dropout_ratio: 0.0
dim_feedforward: 2048
dataset:
train_sampler: default_sampler
infer_data_sources:
image_dir:
- /path/to/inference/input
classmap: /path/to/class/map
num_classes: 2
batch_size: 10
...
inference:
trt_engine: /path/to/converted/trt/engine
conf_threshold: 0.5
input_width: 960
input_height: 544
color_map:
retail item: green
results_dir: /path/to/output/dir
To run infernece of the DINO model, run:
tao deploy dino inference -e=<experiment spec file>
This section describes the steps for deploying v1.1+ models on the DeepStream platform.
RGB Image Dimensions: 960 X 544 X 3 (W x H x C)
Channel Ordering of the Input: NCHW
Where N = Batch Size, C = number of channels (3), H = Height of images (544), W = Width of the images (960).
Category labels and bounding-box coordinates for each detected retail item in the input image.
You can obtain examples to use the Retail Item Detector for a PGIE video analytic application. This section provides instruction for how to deploy these models with DeepStream SDK. DeepStream SDK is a streaming analytic toolkit to accelerate building AI-based video analytic applications. DeepStream supports direct integration of these models into the DeepStream sample app.
To deploy these models with DeepStream 6.2:
The config files for the purpose-built models are located in the default DeepStream installation directory:
/opt/nvidia/deepstream
This path might vary, if you are installing in a different directory.
$DS_TAO_APPS_HOME
, in $DS_TAO_APPS_HOME/configs/nvinfer/retail_object_detection_tao
:# Efficientdet binary-class detector (the primary GIE) inference setting
pgie_retail_object_detection_binary_effdet_tao_config.txt
pgie_retail_object_detection_binary_effdet_tao_config.yml
# DINO-FAN binary-class detector (the primary GIE) inference setting
pgie_retail_object_detection_binary_dino_tao_config.txt
pgie_retail_object_detection_binary_dino_tao_config.yaml
# DINO-FAN meta-class detector (the primary GIE) inference setting
pgie_retail_object_detection_meta_dino_tao_config.txt
pgie_retail_object_detection_meta_dino_tao_config.yaml
Key Parameters in pgie_retail_object_detection_binary_effdet_tao_config.txt
are:
[property]
gpu-id=0
net-scale-factor=1.0
offsets=0;0;0
model-color-format=1
tlt-model-key=nvidia_tao
tlt-encoded-model=/path/to/binary/retail/detector/effdet/etlt/file
model-engine-file=/path/to/meta/retail/detector/effdet/trt/engine # only one of model-engine-file and tlt-encoded-model is needed
labelfile-path=/path/to/binary/retail/detector/label/file
network-input-order=0
infer-dims=3;544;960
maintain-aspect-ratio=1
batch-size=1
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=2
network-type=0
network-input-order=1
num-detected-classes=1
interval=0
gie-unique-id=1
cluster-mode=4
output-blob-names=num_detections;detection_boxes;detection_scores;detection_classes
parse-bbox-func-name=NvDsInferParseCustomEfficientDetTAO
custom-lib-path=$DS_TAO_APPS_HOME/post_processor/libnvds_infercustomparser_tao.so
#Use the config params below for NMS clustering mode
[class-attrs-all]
pre-cluster-threshold=0.8
roi-top-offset=0
roi-bottom-offset=0
detected-min-w=0
detected-min-h=0
detected-max-w=0
detected-max-h=0
Key Parameters in pgie_retail_object_detection_binary_dino_tao_config.txt
are:
[property]
net-scale-factor=0.01735207357279195
offsets=123.675;116.28;103.53
model-color-format=1
onnx-file=/path/to/binary/retail/detector/dino/onnx/file
model-engine-file=/path/to/binary/retail/detector/dino/trt/engine # only one of onnx-file and model-engine-file is needed
labelfile-path=/path/to/binary/retail/detector/label/file
network-input-order=0
infer-dims=3;544;960
maintain-aspect-ratio=1
batch-size=1
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=2
num-detected-classes=2
interval=0
gie-unique-id=1
cluster-mode=4
output-blob-names=pred_boxes;pred_logits
parse-bbox-func-name=NvDsInferParseCustomDDETRTAO
custom-lib-path=../../../post_processor/libnvds_infercustomparser_tao.so
#Use the config params below for NMS clustering mode
[class-attrs-all]
pre-cluster-threshold=0.5
topk=300
To decode the bounding box information from the DINO or EfficientDet output tensor, the custom parser function and library have to be specified.
To run inference the model:
cd $DS_TAO_APPS_HOME/configs/nvinfer/retail_object_detection_tao
$DS_TAO_APPS_HOME/apps/tao_detection/ds-tao-detection -c <config file path> -i file://$DS_TAO_APPS_HOME/samples/streams/retail_object_h264.mp4
For more information, see Integrating TAO Models into DeepStream.
The logos on retail items were smudged.
The logos on retail items were smudged.
The NVIDIA Retail Object Detection model was trained to detect objects larger than 10x10 pixels. Therefore it might not be able to detect objects that are smaller than 10x10 pixels. Having your target objects take >10% of the frame is suggested, so that the model is less likely to be distracted from the noise in the backgrounds.
The Retail Object Detection model was trained on images having one target item per frame, mimicking the retail checkout scene in the real world. Having one object per frame is suggested. Having multiple objects in one frame might challenge the Retail Object Detection model.
When objects are occluded or truncated, such that less than 40% of the object is visible, they might not be detected by the Retail Object Detection model. Partial occlusion by hand is acceptable, because the model was trained with hand occlusion examples.
The Retail Object Detection models were trained on RGB images. Images captured by monochrome or IR cameras might not provide good detection results.
The Retail Object Detection models were not trained on fish-eye lens cameras or moving cameras. The models might not perform well for warped images and images that have motion-induced blurs.
Although the Retail Object Detection models were trained in diverse environments, without the fine-tuning on the target environment, the models perform better on images with a clean background. Examples of a clean background are a checkout plate or a conveyor belt. Reducing the complex textures in the background as much as possible is recommended.
Tan, Mingxing, Ruoming Pang, and Quoc V. Le. "Efficientdet: Scalable and efficient object detection." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
Carion, Nicolas, et al. "End-to-end object detection with transformers." European conference on computer vision. Cham: Springer International Publishing, 2020.
Zhang, Hao, et al. "Dino: Detr with improved denoising anchor boxes for end-to-end object detection." arXiv preprint arXiv:2203.03605 (2022).
Zhou, Daquan, et al. "Understanding the robustness in vision transformers." International Conference on Machine Learning. PMLR, 2022.
License to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.
NVIDIA Retail Object Detection model detects retail items. However, no additional information such as people and other distractors in the background are inferred. Training and evaluation dataset mostly consists of North American content. An ideal training and evaluation dataset would additionally include content from other geographies. NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.