This resource is using open-source code maintained in github (see the quick-start-guide section) and available for download from NGC
EfficientDet is a convolution-based neural network for the task of object detection. This model is based on EfficientDet: Scalable and Efficient Object Detection. NVIDIA's implementation of EfficientDet PyTorch is an optimized version of TensorFlow Model Garden implementation, leveraging mixed precision arithmetic on NVIDIA Volta, NVIDIA Turing, and the NVIDIA Ampere GPU architectures for faster training times while maintaining target accuracy.
The repository also contains scripts to launch training, benchmarking, and inference routines in a Docker container interactively.
The major differences between the official implementation of the paper and our version of EfficientDet are as follows:
These techniques/optimizations improve model performance and reduce training time by a factor of 1.3x, allowing you to perform more efficient object detection with no additional effort.
Other publicly available implementations of EfficientDet include:
EfficientDet is a one-stage detector with the following architecture components:
The default configuration of this model can be found at train.py
. The default hyper-parameters are as follows:
General:
Backbone:
This repository implements multi-gpu to support larger batches and mixed precision support. This implementation also includes the following optimizations.
Custom CUDA kernels for Focal Loss and NMS.
Custom optimized implementation of EMA.
The source files can be found under effdet/csrc
.
The model supports the following features.
Feature | EfficientDet |
---|---|
PyTorch native AMP | Yes |
PyTorch native DDP | Yes |
Custom Fused CUDA kernels | Yes |
PyTorch native AMP is part of PyTorch, which provides convenience methods for mixed precision.
DDP stands for DistributedDataParallel and is used for multi-GPU training.
Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of tensor cores in NVIDIA Volta, and following with both the NVIDIA Turing and NVIDIA Ampere Architectures, significant training speedups are observed by switching to mixed precision—up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
Porting the model to use the FP16 data type where appropriate.
Adding loss scaling to preserve small gradient values.
For information about:
How to train using mixed precision, refer to the Mixed Precision Training paper and Training With Mixed Precision documentation.
Techniques used for mixed precision training, refer to the Mixed-Precision Training of Deep Neural Networks blog.
NVIDIA Apex tools for mixed precision training, refer to the NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch.
In this repository, mixed precision training is enabled by the PyTorch native AMP library. PyTorch has an automatic mixed precision module that allows mixed precision to be enabled with minimal code changes.
Automatic mixed precision can be enabled with the following code changes:
# Create gradient scaler
scaler = torch.cuda.amp.GradScaler(enabled=args.amp)
# Wrap the forward pass and loss in torch.cuda.amp.autocast
with torch.cuda.amp.autocast(enabled=args.amp):
output = model(input, target)
loss = output['loss']
Where args.amp
is the flag to turn on or off AMP. Shell scripts all have a positional argument --amp
available to enable mixed precision training.
TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math, also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on NVIDIA Volta GPUs.
TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models that require a high dynamic range for weights or activations.
For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.
TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.