NGC Catalog
CLASSIC
Welcome Guest
Models
SyntheticaDETR

SyntheticaDETR

For downloads and more information, please view on a desktop device.
Logo for SyntheticaDETR
Description
SytheticaDETR is a real-time object detection model based on a transformer architecture trained entirely in simulation and works on real images zero-shot.
Publisher
NVIDIA
Latest Version
1.0.0
Modified
December 3, 2024
Size
322.68 MB

Model Overview

SyntheticaDETR model is an efficient real-time object detection transformer based neural network, inspired by RT-DETR [1]. It is designed to take RGB images as input and detect objects within the scene with corresponding detection confidence. The output of the network are 2D bounding boxes on the detected objects. SyntheticaDETR is named for its training data, which is 100% synthetically generated providing detection accuracies that exceed similar models which include real data in their training set. The SyntheticaDETR model is fully trained for a set of objects, separated into models for robot grasping, and non-grasp commercial applications. Samples of both grasp and non-grasp objects are seen in Figure. 1 and 2 below.

Figure 1 (Left) SyntheticaDETR detects grocery objects for robot manipulation.
Figure 2 (Right) SyntheticaDETR detects a mobile robot charging station dock.

Architecture

SyntheticaDETR pipeline consists of four modules: A backbone module, an efficient hybrid encoder, a query selector and a decoder head. We assess two backbone architectures: a ConvNeXt-small architecture [2] and a ResNet-50 architecture. The backbone extracts multi-scale features from the image space for downstream feedforward to the hybrid encoder. The encoder tokenizes the features into embeddings and uses attention as a filter to obtain the most important features. The query selector is further used to select some image features to serve as initial object queries for the decoder. Lastly, the decoder uses auxiliary prediction heads to iteratively optimize the object queries and generate bounding boxes and associated confidence scores. We threshold the confidence value at 0.9 for the model trained on grasp objects and 0.7 for the model trained on non-grasp (AMR) objects.

How to Use this Model

Our model can be deployed in a wide variety of scenarios including robotic arm manipulation tasks, autonomous robot navigation, target tracking in delivery robots and so on. The model is intended to be run using NVIDIA hardware and software. For hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. The model is intended for Ubuntu 22.04 with CUDA 11.6 or Jetpack 6.0 DP and later. This model can only be used with TensorRT. The two models in this card are named sdetr_grasp.etlt, named to identify objects that can be grasped and sdetr_amr.etlt, which refers to the model trained on AMR - non-grasp objects.

Instructions to Convert Model to TensorRT Engine Plan

In order to use this model, convert the etlt files to TensorRT engine plans using TAO. The model is encrypted and will only operate with the key rtdetr. Please make sure to use this key with the TAO command to convert it into a TensorRT engine plan.

Convert the SyntheticaDETR pose model to an engine plan:

./tao-converter -k sdetr -t fp16 -e sdetr_grasp.plan -p images,1x3x640x640,2x3x640x640,4x3x640x640 -p orig_target_sizes,1x2,2x2,4x2  sdetr_grasp.etlt

./tao-converter -k sdetr -t fp16 -e sdetr_amr.plan -p images,1x3x640x640,2x3x640x640,4x3x640x640 -p orig_target_sizes,1x2,2x2,4x2  sdetr_amr.etlt

SyntheticaDETR models can be run in ROS 2 using NVIDIA’s Isaac ROS RT-DETR package. The Isaac ROS SyntheticaDETR quickstart includes all required preprocessing and postprocessing to deliver accurate pose estimation in standard ROS message formats.

sample inference results

Figure 3 & 4. Object detection on objects in the YCB dataset.

Intended Use

Our model can be deployed in a wide variety of scenarios including robotic arm manipulation tasks, autonomous robot navigation, target tracking in delivery robots and so on.

Training Dataset

Figure 5 (left) Training synthetic data example with RGB, and object label generated in Isaac Sim for manipulation objects. Figure 6 (right) We included non-grasp objects in our training dataset.

We generated 13 million SDG image samples using Isaac Sim Replicator for training. Each scene is procedurally generated with random domain parameters to diversify the color, lighting, shape and other factors, leading to robust generalization of the network during training. To render the scenes in Omniverse, we select 3D assets from the YCB-V [3] datasets, Google Scanned Objects (GSO)[4], random 3D assets in the wild and other custom grocery assets.

Training

During training, we use experiment with 2 backbone models: Resnet-50 and ConvNeXt-Small. We re-train the end-to-end pipeline for about 20 epochs on a large synthetically generated dataset (SDG). We used the classification loss and localization loss to formulate the training objectives. The classification loss is the log-likelihood loss and it is used to compute the difference in the predicted target object classes from corresponding groundtruth. The localization loss is averaged Intersection over Union (IoU). It is used to optimize the accuracy of predictions on bounding box coordinates. Lastly, we introduce various augmentation techniques to enhance regularization during training.

Key Performance Indicators KPIs

The key performance indicators are mean average precision and mean average recall. They are standard metrics defined for object detection and used to rank different methods on the BOP challenge dataset.

Data Model Name mAP mAR
Synthetic SyntheticaDETR 0.839 0.860
Synthetic and Real SyntheticaDETR 0.885 0.903

Table 1.0 Showing results of 2D amodal detection on the YCB-V BOP benchmark for 21 classes at 640x480 resolution. Result is shown for ConvNeXt-Small backbone.

The metrics shown in Table 1.0 indicates SyntheticaDETR sets new state-of-the-art records on the BOP Leaderboard at the time of publication.

Realtime Inference Performance

In addition, we assess the real-time inference performance of two version models at FP16 precision. The following profiling numbers are obtained using tensorrt’s trtexec tool on Jetson Orin and other NVIDIA GPUs.

Platform Thoughput (FPS, Resnet50) Thoughput (FPS, ConvNeXt-small)
Jetson Orin 86.107 31.6705
RTX 4060Ti 278.151 104.536
RTX 4090 540.533 219.724

Table 2.0 Showing Inference performance numbers on NVIDIA Jetson AGX Orin, RTX 4090, and RTX 4060Ti.

Limitations

NVIDIA SyntheticaDETR model was trained to be used as an object detector for objects. The detected objects in poor lighting conditions could be inaccurate.

References

[1] Lv, Wenyu, et al. "Detrs beat yolos on real-time object detection." arXiv preprint arXiv:2304.08069 (2023).
[2] Liu, Zhuang, et al. "A convnet for the 2020s." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
[3] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan and Dieter Fox. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. In Robotics: Science and Systems (RSS), 2018.
[4] Downs, Laura, et al. "Google scanned objects: A high-quality dataset of 3d scanned household items." 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022.

Ethical Considerations

We trained SyntheticaDETR using only synthetically generated data. The network learns geometry, and semantic understanding from SDG without additional need for real data. Thus, there are no ethical concerns for this work.

Acknowledgements

We acknowledge the sources of assets used to render scenes and objects featured in our datasets. The source of our training dataset is YCB dataset. The assets were scanned from home goods and items purchased online. Please see the source list here.

APPENDIX

Object Classes

For reference, we make a list of all the target classes we used to train the SyntheticaDETR model.

Grasp Model Object Classes Purchase Link
002_master_chef_can N/A. Available Online
003_cracker_box N/A. Available Online
004_sugar_box N/A. Available Online
005_tomato_soup_can N/A. Available Online
YCB Models 006_mustard_bottle N/A. Available Online
007_tuna_fish_can N/A. Available Online
008_pudding_box N/A. Available Online
009_gelatin_box N/A. Available Online
010_potted_meat_can N/A. Available Online
011_banana Link
019_pitcher_base Link
021_bleach_cleanser N/A. Available Online
024_bowl Link
025_mug N/A.
035_power_drill Link
036_wood_block N/A. Available Online
037_scissors Link
040_large_marker Link
051_large_clamp Link
052_extra_large_clamp Link
061_foam_brick Link
Non-grasp Model Object Classes
Nova Carter Dock
Nova Carter
Pallets
Forklifts
Grocery objects Object Classes
365 Mac & Cheese
365 Facial Tissue
Grocery Objects 365 Peanut Butter
365 Olive Can
365 Hand Sanitizing Wipes
Skipjack Tuna in water