NGC | Catalog
CatalogResourcesResNet50 v1.5 for MXNet

ResNet50 v1.5 for MXNet

Logo for ResNet50 v1.5 for MXNet
Description
With modified architecture and initialization this ResNet50 version gives ~0.5% better accuracy than original.
Publisher
NVIDIA Deep Learning Examples
Latest Version
22.10.0
Modified
December 13, 2022
Compressed Size
41.76 KB

This resource is using open-source code maintained in github (see the quick-start-guide section) and available for download from NGC

The ResNet-50 v1.5 model is a modified version of the original ResNet-50 v1 model.

The difference between v1 and v1.5 is in the bottleneck blocks which require downsampling. ResNet v1 has stride = 2 in the first 1x1 convolution, whereas v1.5 has stride = 2 in the 3x3 convolution.

This difference makes ResNet-50 v1.5 slightly more accurate (~0.5% top1) than v1, but comes with a small performance drawback (~5% imgs/sec).

This model is trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 3.5x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.

Model architecture

The model architecture was present in Deep Residual Learning for Image Recognition paper. The main advantage of the model is the usage of residual layers as a building block that helps with gradient propagation during training.

ResidualLayer

Image source: Deep Residual Learning for Image Recognition

Default configuration

Optimizer

  • SGD with momentum (0.875)
  • Learning rate = 0.256 for 256 batch size, for other batch sizes we linearly scale the learning rate
  • Learning rate schedule - we use cosine LR schedule
  • Linear warmup of the learning rate during the first 5 epochs according to Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.
  • Weight decay: 3.0517578125e-05 (1/32768)
  • We do not apply WD on batch norm trainable parameters (gamma/bias)
  • Label Smoothing: 0.1
  • We train for:
    • 50 Epochs - configuration that reaches 75.9% top1 accuracy
    • 90 Epochs - 90 epochs is a standard for ResNet-50
    • 250 Epochs - best possible accuracy. For 250 epoch training we also use MixUp regularization.

Data augmentation

For training:

  • Normalization
  • Random resized crop to 224x224
  • Scale from 8% to 100%
  • Aspect ratio from 3/4 to 4/3
  • Random horizontal flip

For inference:

  • Normalization
  • Scale to 256x256
  • Center crop to 224x224

Feature support matrix

Feature ResNet-50 MXNet
DALI yes
Horovod Multi-GPU yes

Features

The following features are supported by this model.

NVIDIA DALI NVIDIA Data Loading Library (DALI) is a collection of highly optimized building blocks, and an execution engine, to accelerate the pre-processing of the input data for deep learning applications. DALI provides both the performance and the flexibility for accelerating different data pipelines as a single library. This single library can then be easily integrated into different deep learning training and inference applications.

Horovod Multi-GPU Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. For more information about how to get started with Horovod, see the Horovod: Official repository.

Mixed precision training

Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of Tensor Cores in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:

  1. Porting the model to use the FP16 data type where appropriate.
  2. Adding loss scaling to preserve small gradient values.

The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK.

For information about:

Enabling mixed precision

Using the Gluon API, ensure you perform the following steps to convert a model that supports computation with float16.

  1. Cast Gluon Block's parameters and expected input type to float16 by calling the cast method of the Block representing the network.

    net = net.cast('float16')
    
  2. Ensure the data input to the network is of float16 type. If your DataLoader or Iterator produces output in another datatype, then you have to cast your data. There are different ways you can do this. The easiest way is to use the astype method of NDArrays.

    data = data.astype('float16', copy=False)
    
  3. If you are using images and DataLoader, you can also use a Cast transform. It is preferable to use multi_precision mode of optimizer when training in float16. This mode of optimizer maintains a master copy of the weights in float32 even when the training (forward and backward pass) is in float16. This helps increase precision of the weight updates and can lead to faster convergence in some scenarios.

    optimizer = mx.optimizer.create('sgd', multi_precision=True, lr=0.01)
    

Enabling TF32

TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.

TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.

For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.

TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.