SyntheticaDETR model is an efficient real-time object detection transformer based neural network, inspired by RT-DETR [1]. It is designed to take RGB images as input and detect objects within the scene with corresponding detection confidence. The output of the network are 2D bounding boxes on the detected objects. SyntheticaDETR is named for its training data, which is 100% synthetically generated providing detection accuracies that exceed similar models which include real data in their training set. The SyntheticaDETR model is fully trained for a set of objects, separated into models for robot grasping, and non-grasp commercial applications. Samples of both grasp and non-grasp objects are seen in Figure. 1 and 2 below.
SyntheticaDETR pipeline consists of four modules: A backbone module, an efficient hybrid encoder, a query selector and a decoder head. We assess two backbone architectures: a ConvNeXt-small architecture [2] and a ResNet-50 architecture. The backbone extracts multi-scale features from the image space for downstream feedforward to the hybrid encoder. The encoder tokenizes the features into embeddings and uses attention as a filter to obtain the most important features. The query selector is further used to select some image features to serve as initial object queries for the decoder. Lastly, the decoder uses auxiliary prediction heads to iteratively optimize the object queries and generate bounding boxes and associated confidence scores. We threshold the confidence value at 0.9 for the model trained on grasp objects and 0.7 for the model trained on non-grasp (AMR) objects.
Our model can be deployed in a wide variety of scenarios including robotic arm manipulation tasks, autonomous robot navigation, target tracking in delivery robots and so on. The model is intended to be run using NVIDIA hardware and software. For hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. The model is intended for Ubuntu 22.04 with CUDA 11.6 or Jetpack 6.0 DP and later. This model can only be used with TensorRT. The two models in this card are named sdetr_grasp.etlt, named to identify objects that can be grasped and sdetr_amr.etlt, which refers to the model trained on AMR - non-grasp objects.
In order to use this model, convert the etlt files to TensorRT engine plans using TAO. The model is encrypted and will only operate with the key rtdetr. Please make sure to use this key with the TAO command to convert it into a TensorRT engine plan.
./tao-converter -k sdetr -t fp16 -e sdetr_grasp.plan -p images,1x3x640x640,2x3x640x640,4x3x640x640 -p orig_target_sizes,1x2,2x2,4x2 sdetr_grasp.etlt
./tao-converter -k sdetr -t fp16 -e sdetr_amr.plan -p images,1x3x640x640,2x3x640x640,4x3x640x640 -p orig_target_sizes,1x2,2x2,4x2 sdetr_amr.etlt
SyntheticaDETR models can be run in ROS 2 using NVIDIA’s Isaac ROS RT-DETR package. The Isaac ROS SyntheticaDETR quickstart includes all required preprocessing and postprocessing to deliver accurate pose estimation in standard ROS message formats.
Our model can be deployed in a wide variety of scenarios including robotic arm manipulation tasks, autonomous robot navigation, target tracking in delivery robots and so on.
We generated 13 million SDG image samples using Isaac Sim Replicator for training. Each scene is procedurally generated with random domain parameters to diversify the color, lighting, shape and other factors, leading to robust generalization of the network during training. To render the scenes in Omniverse, we select 3D assets from the YCB-V [3] datasets, Google Scanned Objects (GSO)[4], random 3D assets in the wild and other custom grocery assets.
During training, we use experiment with 2 backbone models: Resnet-50 and ConvNeXt-Small. We re-train the end-to-end pipeline for about 20 epochs on a large synthetically generated dataset (SDG). We used the classification loss and localization loss to formulate the training objectives. The classification loss is the log-likelihood loss and it is used to compute the difference in the predicted target object classes from corresponding groundtruth. The localization loss is averaged Intersection over Union (IoU). It is used to optimize the accuracy of predictions on bounding box coordinates. Lastly, we introduce various augmentation techniques to enhance regularization during training.
The key performance indicators are mean average precision and mean average recall. They are standard metrics defined for object detection and used to rank different methods on the BOP challenge dataset.
Data | Model Name | mAP | mAR |
---|---|---|---|
Synthetic | SyntheticaDETR | 0.839 | 0.860 |
Synthetic and Real | SyntheticaDETR | 0.885 | 0.903 |
Table 1.0 Showing results of 2D amodal detection on the YCB-V BOP benchmark for 21 classes at 640x480 resolution. Result is shown for ConvNeXt-Small backbone.
The metrics shown in Table 1.0 indicates SyntheticaDETR sets new state-of-the-art records on the BOP Leaderboard at the time of publication.
In addition, we assess the real-time inference performance of two version models at FP16 precision. The following profiling numbers are obtained using tensorrt’s trtexec tool on Jetson Orin and other NVIDIA GPUs.
Platform | Thoughput (FPS, Resnet50) | Thoughput (FPS, ConvNeXt-small) |
---|---|---|
Jetson Orin | 86.107 | 31.6705 |
RTX 4060Ti | 278.151 | 104.536 |
RTX 4090 | 540.533 | 219.724 |
Table 2.0 Showing Inference performance numbers on NVIDIA Jetson AGX Orin, RTX 4090, and RTX 4060Ti.
NVIDIA SyntheticaDETR model was trained to be used as an object detector for objects. The detected objects in poor lighting conditions could be inaccurate.
[1] Lv, Wenyu, et al. "Detrs beat yolos on real-time object detection." arXiv preprint arXiv:2304.08069 (2023).
[2] Liu, Zhuang, et al. "A convnet for the 2020s." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
[3] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan and Dieter Fox. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. In Robotics: Science and Systems (RSS), 2018.
[4] Downs, Laura, et al. "Google scanned objects: A high-quality dataset of 3d scanned household items." 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022.
We trained SyntheticaDETR using only synthetically generated data. The network learns geometry, and semantic understanding from SDG without additional need for real data. Thus, there are no ethical concerns for this work.
We acknowledge the sources of assets used to render scenes and objects featured in our datasets. The source of our training dataset is YCB dataset. The assets were scanned from home goods and items purchased online. Please see the source list here.
For reference, we make a list of all the target classes we used to train the SyntheticaDETR model.
Grasp Model | Object Classes | Purchase Link |
---|---|---|
002_master_chef_can | N/A. Available Online | |
003_cracker_box | N/A. Available Online | |
004_sugar_box | N/A. Available Online | |
005_tomato_soup_can | N/A. Available Online | |
YCB Models | 006_mustard_bottle | N/A. Available Online |
007_tuna_fish_can | N/A. Available Online | |
008_pudding_box | N/A. Available Online | |
009_gelatin_box | N/A. Available Online | |
010_potted_meat_can | N/A. Available Online | |
011_banana | Link | |
019_pitcher_base | Link | |
021_bleach_cleanser | N/A. Available Online | |
024_bowl | Link | |
025_mug | N/A. | |
035_power_drill | Link | |
036_wood_block | N/A. Available Online | |
037_scissors | Link | |
040_large_marker | Link | |
051_large_clamp | Link | |
052_extra_large_clamp | Link | |
061_foam_brick | Link |
Non-grasp Model | Object Classes |
---|---|
Nova Carter Dock | |
Nova Carter | |
Pallets | |
Forklifts |
Grocery objects | Object Classes |
---|---|
365 Mac & Cheese | |
365 Facial Tissue | |
Grocery Objects | 365 Peanut Butter |
365 Olive Can | |
365 Hand Sanitizing Wipes | |
Skipjack Tuna in water |