NVIDIA
NVIDIA
SyntheticaDETR
Model
NVIDIA
NVIDIA
SyntheticaDETR

SytheticaDETR is a real-time object detection model based on a transformer architecture trained entirely in simulation and works on real images zero-shot.

Model Overview

SyntheticaDETR model is an efficient real-time object detection transformer based neural network, inspired by RT-DETR [1]. It is designed to take RGB images as input and detect objects within the scene with corresponding detection confidence. The output of the network are 2D bounding boxes on the detected objects. SyntheticaDETR is named for its training data, which is 100% synthetically generated providing detection accuracies that exceed similar models which include real data in their training set. The SyntheticaDETR model is fully trained for a set of objects, separated into models for robot grasping, and non-grasp commercial applications. Samples of both grasp and non-grasp objects are seen in Figure. 1 and 2 below.

Figure 1 (Left) SyntheticaDETR detects grocery objects for robot manipulation.
Figure 2 (Right) SyntheticaDETR detects a mobile robot charging station dock.

Architecture

SyntheticaDETR pipeline consists of four modules: A backbone module, an efficient hybrid encoder, a query selector and a decoder head. We assess two backbone architectures: a ConvNeXt-small architecture [2] and a ResNet-50 architecture. The backbone extracts multi-scale features from the image space for downstream feedforward to the hybrid encoder. The encoder tokenizes the features into embeddings and uses attention as a filter to obtain the most important features. The query selector is further used to select some image features to serve as initial object queries for the decoder. Lastly, the decoder uses auxiliary prediction heads to iteratively optimize the object queries and generate bounding boxes and associated confidence scores. We threshold the confidence value at 0.9 for the model trained on grasp objects and 0.7 for the model trained on non-grasp (AMR) objects.

How to Use this Model

Our model can be deployed in a wide variety of scenarios including robotic arm manipulation tasks, autonomous robot navigation, target tracking in delivery robots and so on. The model is intended to be run using NVIDIA hardware and software. For hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. The model is intended for Ubuntu 22.04 with CUDA 11.6 or Jetpack 6.0 DP and later. This model can only be used with TensorRT. The two models in this card are named sdetr_grasp.etlt, named to identify objects that can be grasped and sdetr_amr.etlt, which refers to the model trained on AMR - non-grasp objects.

Instructions to Convert Model to TensorRT Engine Plan

In order to use this model, convert the etlt files to TensorRT engine plans using TAO. The model is encrypted and will only operate with the key rtdetr. Please make sure to use this key with the TAO command to convert it into a TensorRT engine plan.

Convert the SyntheticaDETR pose model to an engine plan:

./tao-converter -k sdetr -t fp16 -e sdetr_grasp.plan -p images,1x3x640x640,2x3x640x640,4x3x640x640 -p orig_target_sizes,1x2,2x2,4x2  sdetr_grasp.etlt

./tao-converter -k sdetr -t fp16 -e sdetr_amr.plan -p images,1x3x640x640,2x3x640x640,4x3x640x640 -p orig_target_sizes,1x2,2x2,4x2  sdetr_amr.etlt

SyntheticaDETR models can be run in ROS 2 using NVIDIA’s Isaac ROS RT-DETR package. The Isaac ROS SyntheticaDETR quickstart includes all required preprocessing and postprocessing to deliver accurate pose estimation in standard ROS message formats.

sample inference results

Figure 3 & 4. Object detection on objects in the YCB dataset.

Intended Use

Our model can be deployed in a wide variety of scenarios including robotic arm manipulation tasks, autonomous robot navigation, target tracking in delivery robots and so on.

Training Dataset

Figure 5 (left) Training synthetic data example with RGB, and object label generated in Isaac Sim for manipulation objects. Figure 6 (right) We included non-grasp objects in our training dataset.

We generated 13 million SDG image samples using Isaac Sim Replicator for training. Each scene is procedurally generated with random domain parameters to diversify the color, lighting, shape and other factors, leading to robust generalization of the network during training. To render the scenes in Omniverse, we select 3D assets from the YCB-V [3] datasets, Google Scanned Objects (GSO)[4], random 3D assets in the wild and other custom grocery assets.

Training

During training, we use experiment with 2 backbone models: Resnet-50 and ConvNeXt-Small. We re-train the end-to-end pipeline for about 20 epochs on a large synthetically generated dataset (SDG). We used the classification loss and localization loss to formulate the training objectives. The classification loss is the log-likelihood loss and it is used to compute the difference in the predicted target object classes from corresponding groundtruth. The localization loss is averaged Intersection over Union (IoU). It is used to optimize the accuracy of predictions on bounding box coordinates. Lastly, we introduce various augmentation techniques to enhance regularization during training.

Key Performance Indicators KPIs

The key performance indicators are mean average precision and mean average recall. They are standard metrics defined for object detection and used to rank different methods on the BOP challenge dataset.

DataModel NamemAPmAR
SyntheticSyntheticaDETR0.8390.860
Synthetic and RealSyntheticaDETR0.8850.903

Table 1.0 Showing results of 2D amodal detection on the YCB-V BOP benchmark for 21 classes at 640x480 resolution. Result is shown for ConvNeXt-Small backbone.

The metrics shown in Table 1.0 indicates SyntheticaDETR sets new state-of-the-art records on the BOP Leaderboard at the time of publication.

Realtime Inference Performance

In addition, we assess the real-time inference performance of two version models at FP16 precision. The following profiling numbers are obtained using tensorrt’s trtexec tool on Jetson Orin and other NVIDIA GPUs.

PlatformThoughput (FPS, Resnet50)Thoughput (FPS, ConvNeXt-small)
Jetson Orin86.10731.6705
RTX 4060Ti278.151104.536
RTX 4090540.533219.724

Table 2.0 Showing Inference performance numbers on NVIDIA Jetson AGX Orin, RTX 4090, and RTX 4060Ti.

Limitations

NVIDIA SyntheticaDETR model was trained to be used as an object detector for objects. The detected objects in poor lighting conditions could be inaccurate.

References

[1] Lv, Wenyu, et al. "Detrs beat yolos on real-time object detection." arXiv preprint arXiv:2304.08069 (2023).
[2] Liu, Zhuang, et al. "A convnet for the 2020s." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
[3] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan and Dieter Fox. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. In Robotics: Science and Systems (RSS), 2018.
[4] Downs, Laura, et al. "Google scanned objects: A high-quality dataset of 3d scanned household items." 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022.

Ethical Considerations

We trained SyntheticaDETR using only synthetically generated data. The network learns geometry, and semantic understanding from SDG without additional need for real data. Thus, there are no ethical concerns for this work.

Acknowledgements

We acknowledge the sources of assets used to render scenes and objects featured in our datasets. The source of our training dataset is YCB dataset. The assets were scanned from home goods and items purchased online. Please see the source list here.

APPENDIX

Object Classes

For reference, we make a list of all the target classes we used to train the SyntheticaDETR model.

Grasp ModelObject ClassesPurchase Link
002_master_chef_canN/A. Available Online
003_cracker_boxN/A. Available Online
004_sugar_boxN/A. Available Online
005_tomato_soup_canN/A. Available Online
YCB Models006_mustard_bottleN/A. Available Online
007_tuna_fish_canN/A. Available Online
008_pudding_boxN/A. Available Online
009_gelatin_boxN/A. Available Online
010_potted_meat_canN/A. Available Online
011_bananaLink
019_pitcher_baseLink
021_bleach_cleanserN/A. Available Online
024_bowlLink
025_mugN/A.
035_power_drillLink
036_wood_blockN/A. Available Online
037_scissorsLink
040_large_markerLink
051_large_clampLink
052_extra_large_clampLink
061_foam_brickLink
Non-grasp ModelObject Classes
Nova Carter Dock
Nova Carter
Pallets
Forklifts
Publisher
NVIDIA
NVIDIA
Latest Version1.0.0
UpdatedDecember 3, 2024 UTC
Compressed Size322.68 MB

NVIDIA uses cookies to improve your experience on our web site. We and our third-party partners also use cookies and other tools to collect and record information you provide as well as information about your interactions with our websites for performance improvement, analytics, and to assist in marketing efforts. By clicking "Accept All", you consent to our use of cookies and other tools as described in our Cookie Policy. You can manage your cookie settings by clicking on "Manage Settings." By continuing to use this site or by clicking one of the buttons below, you agree to our Terms of Service (which contains important waivers). Please see our Privacy Policy for more information on our privacy practices.