This resource is using open-source code maintained in github (see the quick-start-guide section) and available for download from NGC
EfficientNet is an image classification model family. It was first described in EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. The scripts provided enable you to train the EfficientNet-B0, EfficientNet-B4, EfficientNet-WideSE-B0 and, EfficientNet-WideSE-B4 models.
EfficientNet-WideSE models use Squeeze-and-Excitation layers wider than original EfficientNet models, the width of SE module is proportional to the width of Depthwise Separable Convolutions instead of block width.
WideSE models are slightly more accurate than original models.
This model is trained with mixed precision using Tensor Cores on Volta and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results over 2x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
We use NHWC data layout when training using Mixed Precision.
The following sections highlight the default configurations for the EfficientNet models.
Optimizer
This model uses RMSprop with the following hyperparameters:
Optimizer for QAT
This model uses SGD optimizer for B0 models and RMSPROP optimizer alpha=0.853 epsilon=0.00422 for B4 models. Other hyperparameters we used are:
Data augmentation
This model uses the following data augmentation:
For training:
For inference:
The following features are supported by this model:
Feature | EfficientNet |
---|---|
DALI | Yes (without autoaugmentation) |
APEX AMP | Yes |
QAT | Yes |
NVIDIA DALI
DALI is a library accelerating data preparation pipeline. To accelerate your input pipeline, you only need to define your data loader with the DALI library. For more information about DALI, refer to the DALI product documentation.
We use NVIDIA DALI, which speeds up data loading when CPU becomes a bottleneck. DALI can use CPU or GPU, and outperforms the PyTorch native dataloader.
Run training with --data-backends dali-gpu
or --data-backends dali-cpu
to enable DALI.
For DGXA100 and DGX1 we recommend --data-backends dali-cpu
.
DALI currently does not support Autoaugmentation, so for best accuracy it has to be disabled.
A PyTorch extension that contains utility libraries, such as Automatic Mixed Precision (AMP), which require minimal network code changes to leverage Tensor Cores performance. Refer to the Enabling mixed precision section for more details.
Quantization aware training (QAT) is a method for changing precision to INT8 which speeds up the inference process at the price of a slight decrease of network accuracy. Refer to the Quantization section for more details.
Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of Tensor Cores in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK.
For information about:
Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision (AMP), a library from APEX that casts variables to half-precision upon retrieval,
while storing variables in single-precision format. Furthermore, to preserve small gradient magnitudes in backpropagation, a loss scaling step must be included when applying gradients.
In PyTorch, loss scaling can be easily applied by using scale_loss()
method provided by AMP. The scaling value to be used can be dynamic or fixed.
For an in-depth walk through on AMP, check out sample usage here. APEX is a PyTorch extension that contains utility libraries, such as AMP, which require minimal network code changes to leverage Tensor Cores performance.
To enable mixed precision, you can:
Import AMP from APEX:
from apex import amp
Wrap model and optimizer in amp.initialize
:
model, optimizer = amp.initialize(model, optimizer, opt_level="O1", loss_scale="dynamic")
Scale loss before backpropagation:
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.
TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.
TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
Quantization is the process of transforming deep learning models to use parameters and computations at a lower precision. Traditionally, DNN training and inference have relied on the IEEE single-precision floating-point format, using 32 bits to represent the floating-point model weights and activation tensors.
This compute budget may be acceptable at training as most DNNs are trained in data centers or in the cloud with NVIDIA V100 or A100 GPUs that have significantly large compute capability and much larger power budgets. However, during deployment, these models are most often required to run on devices with much smaller computing resources and lower power budgets at the edge. Running a DNN inference using the full 32-bit representation is not practical for real-time analysis given the compute, memory, and power constraints of the edge.
To help reduce the compute budget, while not compromising on the structure and number of parameters in the model, you can run inference at a lower precision. Initially, quantized inferences were run at half-point precision with tensors and weights represented as 16-bit floating-point numbers. While this resulted in compute savings of about 1.2–1.5x, there was still some compute budget and memory bandwidth that could be leveraged. In lieu of this, models are now quantized to an even lower precision, with an 8-bit integer representation for weights and tensors. This results in a model that is 4x smaller in memory and about 2–4x faster in throughput.
While 8-bit quantization is appealing to save compute and memory budgets, it is a lossy process. During quantization, a small range of floating-point numbers are squeezed to a fixed number of information buckets. This results in loss of information.
The minute differences which could originally be resolved using 32-bit representations are now lost because they are quantized to the same bucket in 8-bit representations. This is similar to rounding errors that one encounters when representing fractional numbers as integers. To maintain accuracy during inferences at a lower precision, it is important to try and mitigate errors arising due to this loss of information.
In QAT, the quantization error is considered when training the model. The training graph is modified to simulate the lower precision behavior in the forward pass of the training process. This introduces the quantization errors as part of the training loss, which the optimizer tries to minimize during the training. Thus, QAT helps in modeling the quantization errors during training and mitigates its effects on the accuracy of the model at deployment.
However, the process of modifying the training graph to simulate lower precision behavior is intricate. To run QAT, it is necessary to insert FakeQuantization nodes for the weights of the DNN Layers and Quantize-Dequantize (QDQ) nodes to the intermediate activation tensors to compute their dynamic ranges.
For more information, see this Quantization paper and Quantization-Aware Training documentation.
Tutorial for pytoch-quantization
library can be found here pytorch-quantization
tutorial.
It is important to mention that EfficientNet is NN, which is hard to quantize because the activation function all across the network is the SiLU (called also the Swish), whose negative values lie in very short range, which introduce a large quantization error. More details can be found in Appendix D of the Quantization paper.