This model encodes retail items to embedding vectors and predicts their labels based on the embedding vectors in the reference space.
The model consists of a trunk and an embedder. The trunk uses the architecture of ResNet101 with its fully connected layer removed. The embedder is a one-layer Perceptron with an input size of 2048 (the output size of the Average Pool in ResNet101) and an output size of 2048. Thus the embedding dimension of the Retail Embedding model is 2048.
This model was trained with the Triplet Loss network algorithm. The training algorithm optimizes the network to minimize the embedding output distances (cosine similarity) between the positive images and the anchor image while maximizing the distances between the negative images and the anchor image.
The trunk and embedder use different learning rates during training. The embedder uses a smaller learning rate than the trunk for a better fine-tuning effect.
The training data of the Retail Item Embedding model was cropped from images for Retail Item Detection model training and fine-tuning data [add link here]. Thus it is made up of both synthetic data and real data.
Specifically, the model was trained on a mixture of 0.6 million synthetic images and 50k real images. During the training phase, the triplet loss on the mixture would be optimized. And during the validation phase, the accuracy of the similarity search would be calculated. The reference data for the validation set are synthetic while the query data for the validation set are real. This setup is to help the model to overcome the simulation-to-reality gap, so the model is able to learn the class features from both synthetic and real image sources.
Multiple angles of the retail items were collected in the training data, thus the model was trained to recognize the retail item given a random angle of it.
dataset | total #images | train #images | val #images |
---|---|---|---|
Synthetic data | 600,000 | 570,000 | 30,000 |
Real data | 53,476 | 47,569 | 5,907 |
The real training images were cropped from Retail Item Detection datasets with ground-truth bounding-boxes and categories by human labellers. However, this model does not support re-train at this moment. It only allows inference on both seen and unseen classes. To run inference on your own datasets, you may follow the guidelines below.
Reference data is the database for similarity search during the inference stage for the Retail Item Embedding model. The prediction of the inference images would be decided by the L2 distances of the extracted features. Specifically, the algorithm would select the reference object with the smallest L2 distance to the query object in the reference database by Kmeans, and the predicted class would be the corresponding class of the selected reference object.
Therefore, to achieve the highest accuracy for retail item recognition, the reference data needs to be as close to the inference data as possible, regarding the background, occlusion, object orientations, etc.
For instance, if you decide that you only want to infer the retail items with the front face, then you can collect the front side of the retail items only as reference data. On the other hand, if you want the Retail Item Embedding model to recognize the items with whatever angles presented, then more orientations of the retail items need to be collected in the reference dataset.
Generally 20-30 images/class for reference data is of the highest efficiency. However, it would be definitely better to collect more reference examples, say 100 images/class.
Below are the guidelines for the specific conditions of the images:
Same as the reference data guidelines.
Notice that the Retail Item Embedding model can never correctly classify the retail item if the class is not in the reference dataset.
To get the most accurate predictions, you should avoid challenging the Retail Item Embedding model with some bad views, such as the top of a soda can (as this view can be the same across many different retail items).
The evaluation of the Retail Item Embedding model was measured against 100k images with 2000 classes from Aliproducts subset. The 2000 classes were selected based on the standard that they have > 150 train images/class. Notice that the images selected for evaluation are from the train set of the Aliproducts. This is because the validation set only has 2-4 images/class, which is not enough for our test. And all test classes are never seen by the Retail Item Embedding model before.
Aliproducts subset classes: 2000 Aliproducts classes list
Mean accuracy across the classes is calculated. The KPI for the evaluation data are reported in the table below.
#test images/class | #images/class in reference database | overall mean class accuracy (%) |
---|---|---|
50 | 1 | 44.21 |
50 | 2 | 53.91 |
50 | 3 | 59.00 |
50 | 4 | 62.44 |
50 | 5 | 64.95 |
50 | 6 | 66.81 |
50 | 7 | 68.31 |
50 | 8 | 69.49 |
50 | 9 | 70.43 |
50 | 10 | 71.30 |
50 | 20 | 76.31 |
50 | 30 | 78.50 |
50 | 40 | 79.95 |
50 | 50 | 80.93 |
50 | 60 | 81.63 |
50 | 70 | 82.22 |
50 | 80 | 82.81 |
50 | 90 | 83.22 |
50 | 100 | 83.66 |
The inference is run on the provided unpruned model at FP16 precision. The model input resolution is 224x224. The inference performance is run using trtexec on Jetson AGX Orin 64GB and A10. The performance shown here is the inference only performance. The end-to-end performance with streaming video data might slightly vary depending on other bottlenecks in the hardware and software.
model | device | batch size | Latency (ms) | Images per second |
---|---|---|---|---|
Retail Item Embedding | Jetson AGX Orin 64GB | 1 | 1.59 | 627 |
Retail Item Embedding | Jetson AGX Orin 64GB | 16 | 12.83 | 1247 |
Retail Item Embedding | Jetson AGX Orin 64GB | 32 | 23.61 | 1356 |
Retail Item Embedding | Tesla A10 | 1 | 0.98 | 1018 |
Retail Item Embedding | Tesla A10 | 16 | 5.95 | 2690 |
Retail Item Embedding | Tesla A10 | 64 | 20.61 | 3106 |
This model temporarily does not support being used as pretrained weights for transfer learning.
Here we give an example of using the Retail Item Embedder together with the Retail Item Detection[TODO: add url here] for an end-to-end video analytic application. To do so, deploy these models with DeepStream SDK 6.2. DeepStream SDK is a streaming analytic toolkit to accelerate building AI-based video analytic applications. It supports direct integration of these models into the deepstream sample app.
Download and install DeepStream SDK. The installation instructions for DeepStream are provided in DeepStream development guide. The config files for the purpose-built models are located in:
/opt/nvidia/deepstream
is the default DeepStream installation directory. This path will be different if you are installing in a different directory.
You will need config files from two folders. These files are provided in NVIDIA-AI-IOT(TODO: Update the URL when deepstream_tao_apps are merged with???). Assume the repo is cloned under $DS_TAO_APPS_HOME
, in $DS_TAO_APPS_HOME/configs/retailEmbedder_tao
,
# Main config file driven by deepstream-mdx-perception
config.yml
# Header data for the metadata sent to a message broker
msgconv_sample_config.txt
# Embedder model (the secondary GIE module) inference settings
sgie_retailEmbedder_tao_config.yml
# Defines the video sources
sources.csv
Key Parameters in sgie_retailEmbedder_tao_config.yml
property:
net-scale-factor: 0.003921568627451
offsets: 0;0;0
model-color-format: 0
tlt-model-key: nvidia_tlt
tlt-encoded-model: ../../models/retailEmbedder/retailEmbedder.etlt
model-engine-file: ../../models/retailEmbedder/retailEmbedder.etlt_b16_gpu0_fp16.engine
infer-dims: 3;224;224
batch-size: 16
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode: 2
network-type: 100
interval: 0
## Infer Processing Mode 1=Primary Mode 2=Secondary Mode
process-mode: 2
output-tensor-meta: 1
And in $DS_TAO_APPS_HOME/configs/retailDetector_tao
,
# 100-class detector (the primary GIE) inference setting
pgie_retailDetector_100_config.yml
pgie_retailDetector_100_config.txt
# Binary-class detector (the primary GIE) inference setting
pgie_retailDetector_binary_config.yml
pgie_retailDetector_binary_config.txt
Go to $DS_TAO_APPS_HOME/apps/tao_others/deepstream-mdx-perception-app
and run:
cd $DS_TAO_APPS_HOME/apps/tao_others/deepstream-mdx-perception-app
deepstream-mdx-perception-app -c <retailEmbedder diretory>/config.yml
The "Deploying to DeepStream" chapter of TAO User Guide provides detailed documentation.
NVIDIA Retail Item Embedding model was trained to classify objects larger than 10x10 pixels. Therefore it may generate poor results when classifying objects that are smaller than 10x10 pixels.
When objects are occluded or truncated such that less than 40% of the object is visible, they may not be correctly classified by the Retail Item Detection model. Partial occlusion by hand is acceptable as the model was trained with examples having random occlusions.
The Retail Item Embedding model was trained on RGB images. Therefore, images captured in a monochrome image or IR camera image may not provide good detection results.
The Retail Item Embedding model was not trained on fish-eye lense cameras or moving cameras. Therefore, the models may not perform well for warped images and images that have motion-induced or other blur. Model versions
Hoffer, Elad, and Nir Ailon. "Deep metric learning using triplet network." International workshop on similarity-based pattern recognition. Springer, Cham, 2015.
Na, Shi, Liu Xumin, and Guan Yong. "Research on k-means clustering algorithm: An improved k-means clustering algorithm." 2010 Third International Symposium on intelligent information technology and security informatics. Ieee, 2010.
License to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.
NVIDIA Retail Item Embedding model classifies retail items. However, no additional information such as people and other distractors in the background are inferred. Training and evaluation dataset mostly consists of North American content. An ideal training and evaluation dataset would additionally include content from other geographies. NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.