EfficientNet V2 For Tensorflow2

NVIDIA Deep Learning Examples

Resource

NVIDIA Deep Learning Examples

EfficientNet V2 For Tensorflow2

EfficientNets are a family of image classification models, which achieve state-of-the-art accuracy, being an order-of-magnitude smaller and faster.

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

Training benchmark for EfficientNet v2-S was run on NVIDIA DGX A100 80GB and NVIDIA DGX-1 V100 32GB.

bash ./scripts/S/training/{AMP, FP32, TF32}/train_benchmark_8x{A100-80G, V100-32G}.sh

Inference performance benchmark

Inference benchmark for EfficientNet v2-S was run on NVIDIA DGX A100 80GB and NVIDIA DGX-1 V100 32GB.

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference.

Training results for EfficientNet v2-S

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the training scripts in the tensorflow:21.09-tf2-py3 NGC container on multi-node NVIDIA DGX A100 (8x A100 80GB) GPUs. We evaluated the models using both the original and EMA weights and selected the higher accuracy to report.

GPUs	Accuracy - TF32	Accuracy - mixed precision	Time to train - TF32	Time to train - mixed precision	Time to train speedup (TF32 to mixed precision)
8	83.87%	83.93%	32hrs	14hrs	2.28
16	83.89%	83.83%	16hrs	7hrs	2.28

Training accuracy: NVIDIA DGX-1 (8x V100 32GB)

Our results were obtained by running the training scripts in the tensorflow:21.09-tf2-py3 NGC container on multi-node NVIDIA DGX V100 (8x V100 32GB) GPUs. We evaluated the models using both the original and EMA weights and selected the higher accuracy to report.

GPUs	Accuracy - FP32	Accuracy - mixed precision	Time to train - FP32	Time to train - mixed precision	Time to train speedup (FP32 to mixed precision)
8	83.86%	84.0%	90.3hrs	55hrs	1.64
16	83.75%	83.87%	60.5hrs	28.5hrs	2.12
32	83.81%	83.82%	30.2hrs	15.5hrs	1.95

Training performance results for EfficientNet v2-S

Training performance: NVIDIA DGX A100 (8x A100 80GB)

EfficientNet v2-S uses images of increasing resolution during training. Since throughput changes depending on the image size, we have measured throughput based on the image size used in the last stage of training (300x300).

GPUs	Throughput - TF32	Throughput - mixed precision	Throughput speedup (TF32 - mixed precision)	Weak scaling - TF32	Weak scaling - mixed precision
1	390	950	2.43	1	1
8	2800	6600	2.35	7.17	6.94
16	5950	14517	2.43	15.25	15.28

Training performance: NVIDIA DGX-1 (8x V100 32GB)

GPUs	Throughput - FP32	Throughput - mixed precision	Throughput speedup (FP32 - mixed precision)	Weak scaling - FP32	Weak scaling - mixed precision
1	156	380	2.43	1	1
8	952	1774	1.86	6.10	4.66
16	1668	3750	2.25	10.69	9.86
32	3270	7250	2.2	20.96	19.07

Training EfficientNet v2-S at scale

10x NVIDIA DGX-1 V100 (8x V100 32GB)

We trained EfficientNet v2-S at scale using 10 DGX-1 machines each having 8x V100 32GB GPUs. We used the same set of hyperparameters and NGC container as before. Also, throughput numbers were measured in the last stage of training. The accuracy was selected as the better between that of the original weights and EMA weights.

# Nodes	GPUs	Optimizer	Accuracy - mixed precision	Time to train - mixed precision	Time to train speedup	Throughput - mixed precision	Throughput scaling
1	8	RMSPROP	84.0%	55hrs	1	1774	1
10	80	RMSPROP	83.76%	6.5hrs	8.46	16039	9.04

10x NVIDIA DGX A100 (8x A100 80GB)

We trained EfficientNet v2-S at scale using 10 DGX A100 machines each having 8x A100 80GB GPUs. This training setting has an effective batch size of 36800 (460x8x10), which requires advanced optimizers particularly designed for large-batch training. For this purpose, we used the nvLAMB optimizer with the following hyper parameters: lr_warmup_epochs=10, beta_1=0.9, beta_2=0.999, epsilon=0.000001, grad_global_clip_norm=1, lr_init=0.00005, weight_decay=0.00001. As before, we used tensorflow:21.09-tf2-py3 NGC container and measured throughput numbers in the last stage of training. The accuracy was selected as the better between that of the original weights and EMA weights.

# Nodes	GPUs	Optimizer	Accuracy - mixed precision	Time to train - mixed precision	Time to train speedup	Throughput - mixed precision	Throughput scaling
1	8	RMSPROP	83.93%	14hrs	1	6600	1
10	80	nvLAMB	82.84%	1.84hrs	7.60	62130	9.41

Inference performance results for EfficientNet v2-S

Inference performance: NVIDIA DGX A100 (1x A100 80GB)

Our results were obtained by running the inferencing benchmarking script in the tensorflow:21.09-tf2-py3 NGC container on the NVIDIA DGX A100 (1x A100 80GB) GPU.

FP16 Inference Latency

Batch size	Resolution	Throughput Avg	Latency Avg (ms)	Latency 90% (ms)	Latency 95% (ms)	Latency 99% (ms)
1	384x384	29	33.99	33.49	33.69	33.89
8	384x384	204	39.14	38.61	38.82	39.03
32	384x384	772	41.35	40.64	40.90	41.15
128	384x384	1674	76.45	74.20	74.70	75.80
256	384x384	1960	130.57	127.34	128.74	130.27
512	384x384	2062	248.18	226.80	232.86	248.18
1024	384x384	2032	503.73	461.78	481.50	503.73

TF32 Inference Latency

Batch size	Resolution	Throughput Avg	Latency Avg (ms)	Latency 90% (ms)	Latency 95% (ms)	Latency 99% (ms)
1	384x384	39	25.55	25.05	25.26	25.47
8	384x384	244	32.75	32.16	32.40	32.64
32	384x384	777	41.13	40.69	40.84	41.00
128	384x384	1000	127.94	126.71	127.12	127.64
256	384x384	1070	239.08	235.45	236.79	238.39
512	384x384	1130	452.71	444.64	448.18	452.71

Inference performance: NVIDIA DGX-1 (1x V100 32GB)

Our results were obtained by running the inferencing benchmarking script in the tensorflow:21.09-tf2-py3 NGC container on the NVIDIA DGX V100 (1x V100 32GB) GPU.

FP16 Inference Latency

Batch size	Resolution	Throughput Avg	Latency Avg (ms)	Latency 90% (ms)	Latency 95% (ms)	Latency 99% (ms)
1	384x384	29	33.99	33.49	33.69	33.89
8	384x384	184	43.37	42.80	43.01	43.26
32	384x384	592	52.96	53.20	53.45	53.72
128	384x384	933	136.98	134.44	134.79	136.05
256	384x384	988	258.94	251.56	252.86	257.92

FP32 Inference Latency

Batch size	Resolution	Throughput Avg	Latency Avg (ms)	Latency 90% (ms)	Latency 95% (ms)	Latency 99% (ms)
1	384x384	45	22.02	21.87	21.93	21.99
8	384x384	260	30.73	30.33	30.51	30.67
32	384x384	416	76.89	76.57	76.65	76.74
128	384x384	460	278.24	276.56	276.93	277.74