NGC | Catalog
Welcome Guest
CatalogResourcesEfficientDet For PyTorch

EfficientDet For PyTorch

For downloads and more information, please view on a desktop device.
Logo for EfficientDet For PyTorch

Description

A convolution-based neural network for the task of object detection

Publisher

NVIDIA

Use Case

Object Detection

Framework

PyTorch

Latest Version

21.06.1

Modified

March 2, 2022

Compressed Size

449.48 KB

Benchmarking

Benchmarking can be performed for both training and inference. Both the scripts run the EfficientDet model. You can specify whether benchmarking is performed in AMP, TF32, or FP32 by specifying it as an argument to the benchmarking scripts.

Training performance benchmark

Training benchmarking can be performed by running the script:

scripts/D0/train-benchmark_{AMP, TF32, FP32}_{V100-32G, A100-80G}.sh

Inference performance benchmark

Inference benchmarking can be performed by running the script:

scripts/D0/inference_{AMP, FP32, TF32}_{A100-80G, V100-32G}.sh

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference.

Training Accuracy Results

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the scripts/D0/train_{AMP, TF32}_8xA100-80G.sh training script in the 21.06-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs with no intermediate evaluation.

GPUs BBOX mAP - TF32 BBOX mAP - FP16 Time to train - TF32 Time to train - mixed precision Time to train - speedup (TF32 to mixed precision)
8 0.3399 0.3407 8.57 6.5 1.318
Training accuracy: NVIDIA DGX-1 (8x V100 32GB)

Our results were obtained by running the scripts/D0/train_{AMP, FP32}_8xV100-32G.sh training script in the PyTorch 21.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 32GB GPUs with no intermediate evaluation.

GPUs BBOX mAP - FP32 BBOX mAP - FP16 Time to train - FP32 Time to train - mixed precision Time to train - speedup (FP32 to mixed precision)
8 0.3410 0.3413 16 10.5 1.52
Training accuracy: NVIDIA DGX-1 (32x V100 32GB)

Our results were obtained by running the scripts/D0/train_{AMP, FP32}_32xV100-32G.sh training script in the PyTorch 21.06-py3 NGC container on NVIDIA DGX-1 with 32x V100 32GB GPUs with no intermediate evaluation.

GPUs BBOX mAP - FP32 BBOX mAP - FP16 Time to train - FP32 Time to train - mixed precision Time to train - speedup (FP32 to mixed precision)
32 0.3418 0.3373 6 4.95 1.22
Training accuracy on Waymo dataset: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the scripts/waymo/train_waymo_AMP_8xA100-80G.sh training script in the 21.06-py3 NGC container on the Waymo dataset on NVIDIA DGX A100 (8x A100 80GB) GPUs with no intermediate evaluation. These results were obtained by training the EfficientDet-D0 model with a frozen backbone.

category mAP category AP @ IoU 0.7 category AP @ IoU 0.5 category AP @ IoU 0.5
L2_ALL_NS 50.377 Vehicle 50.271 Pedestrian 61.788 Cyclist 39.072

The following results were obtained by training the EfficientDet-D0 model without freezing any part of the architecture. This can be done by removing the --freeze_layer argument from the script.

category mAP category AP @ IoU 0.7 category AP @ IoU 0.5 category AP @ IoU 0.5
L2_ALL_NS 51.249 Vehicle 51.091 Pedestrian 62.816 Cyclist 39.841
Training loss curves

Loss Curve

Here, multihead loss is simply the weighted sum of losses on the classification head and the bounding box head.

Training Stability Test

The following tables compare mAP scores across five different training runs with different seeds. The runs showcase consistent convergence on all five seeds with very little deviation.

Config Seed 1 Seed 2 Seed 3 Seed 4 Seed 5 Mean Standard Deviation
8 GPUs, final AP BBox 0.3422 0.3379 0.3437 0.3424 0.3402 0.3412 0.002

Training Performance Results

Training performance: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the scripts/D0/train_benchmark_{AMP, TP32}_8xA100-80G.sh training script in the 21.06-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers in images per second were averaged over an entire training epoch.

GPUs Throughput - TF32 Throughput - mixed precision Throughput speedup (TF32 - mixed precision) Weak scaling - TF32 Weak scaling - mixed precision
1 170 255 1.5 1 1
4 616 866 1.4 3.62 3.39
8 1213 1835 1.5 7.05 7.05
Training performance: NVIDIA DGX-1 (8x V100 32GB)

Our results were obtained by running the scripts/D0/train_benchmark_{AMP, FP32}_8xV100-32G.sh training script in the 21.06-py3 NGC container on NVIDIA DGX-1 with (8x V100 32GB) GPUs. Performance numbers in images per second were averaged over an entire training epoch.

GPUs Throughput - FP32 Throughput - mixed precision Throughput speedup (FP32 - mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
1 110 186 1.69 1 1
4 367 610 1.66 3.33 3.28
8 613 1040 1.69 5.57 5.59

To achieve similar results, follow the steps in the Quick Start Guide.

Inference performance results

Inference performance: NVIDIA DGX A100 (1x A100 40GB)

Our results were obtained by running the scripts/inference_{AMP, TF32}_A100-80G.sh training script in the PyTorch 21.06-py3 NGC container on NVIDIA DGX A100 (1x A100 80GB) GPU.

GPUs Batch size / GPU Throughput - TF32 Throughput - mixed precision Throughput speedup (TF32 - mixed precision)
1 8 45.61 50.23 1.101

To achieve similar results, follow the steps in the Quick Start Guide.

Inference performance: NVIDIA DGX-1 (1x V100 32GB)

Our results were obtained by running the scripts/inference_{AMP, FP32}_V100-32G.sh training script in the PyTorch 21.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 32GB GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.

GPUs Batch size / GPU Throughput - FP32 Throughput - mixed precision Throughput speedup (FP32 - mixed precision)
1 8 38.81 42.25 1.08

To achieve these same results, follow the steps in the Quick Start Guide.