EfficientDet For PyTorch

EfficientDet For PyTorch

Logo for EfficientDet For PyTorch
A convolution-based neural network for the task of object detection
Latest Version
April 4, 2023
Compressed Size
449.48 KB

EfficientDet is a convolution-based neural network for the task of object detection. This model is based on EfficientDet: Scalable and Efficient Object Detection. NVIDIA's implementation of EfficientDet PyTorch is an optimized version of TensorFlow Model Garden implementation, leveraging mixed precision arithmetic on NVIDIA Volta, NVIDIA Turing, and the NVIDIA Ampere GPU architectures for faster training times while maintaining target accuracy.

The repository also contains scripts to launch training, benchmarking, and inference routines in a Docker container interactively.

The major differences between the official implementation of the paper and our version of EfficientDet are as follows:

  • Mixed precision support with PyTorch AMP.
  • Multi-node training support.
  • Custom fused CUDA kernels for faster computations.
  • Lightweight logging using dllogger
  • PyTorch multi-tensor ops for faster computation.

These techniques/optimizations improve model performance and reduce training time by a factor of 1.3x, allowing you to perform more efficient object detection with no additional effort.

Other publicly available implementations of EfficientDet include:

Model architecture

EfficientDet is a one-stage detector with the following architecture components:

  • ImageNet-pretrained EfficientNet backbone
  • Weighted bi-directional feature pyramid network (BiFPN)
  • Bounding and classification box head
  • A compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time

Default Configuration

The default configuration of this model can be found at train.py. The default hyper-parameters are as follows:

  • General:

    • Base Global Learning Rate set to 0.01
    • Epochs set to 300
    • Local train batch size - 32
    • Local test batch size - 32
  • Backbone:

    • Backend network set to EfficientNet-B0

This repository implements multi-gpu to support larger batches and mixed precision support. This implementation also includes the following optimizations.

  • Custom CUDA kernels for Focal Loss and NMS.

  • Custom optimized implementation of EMA.

    The source files can be found under effdet/csrc.

Feature support matrix

The model supports the following features.

Feature EfficientDet
PyTorch native AMP Yes
PyTorch native DDP Yes
Custom Fused CUDA kernels Yes


PyTorch native AMP is part of PyTorch, which provides convenience methods for mixed precision.

DDP stands for DistributedDataParallel and is used for multi-GPU training.

Mixed precision training

Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of tensor cores in NVIDIA Volta, and following with both the NVIDIA Turing and NVIDIA Ampere Architectures, significant training speedups are observed by switching to mixed precision—up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:

  1. Porting the model to use the FP16 data type where appropriate.

  2. Adding loss scaling to preserve small gradient values.

For information about:

NVIDIA Apex tools for mixed precision training, refer to the NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch.

Enabling mixed precision

In this repository, mixed precision training is enabled by the PyTorch native AMP library. PyTorch has an automatic mixed precision module that allows mixed precision to be enabled with minimal code changes.

Automatic mixed precision can be enabled with the following code changes:

  # Create gradient scaler
  scaler = torch.cuda.amp.GradScaler(enabled=args.amp)
  # Wrap the forward pass and loss in torch.cuda.amp.autocast
  with torch.cuda.amp.autocast(enabled=args.amp):
    output = model(input, target)
    loss = output['loss']

Where args.amp is the flag to turn on or off AMP. Shell scripts all have a positional argument --amp available to enable mixed precision training.

Enabling TF32

TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math, also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on NVIDIA Volta GPUs.

TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models that require a high dynamic range for weights or activations.

For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.

TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.