EfficientNet V1 For Tensorflow2

NVIDIA Deep Learning Examples

Resource

NVIDIA Deep Learning Examples

EfficientNet V1 For Tensorflow2

EfficientNets are a family of image classification models, which achieve state-of-the-art accuracy, being an order-of-magnitude smaller and faster.

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

Training benchmark for EfficientNet v1-B0 was run on NVIDIA DGX A100 80GB and NVIDIA DGX-1 V100 32GB.

To benchmark training performance with other parameters, run:

bash ./scripts/B0/training/{AMP, FP32, TF32}/train_benchmark_8x{A100-80G, V100-32G}.sh

Training benchmark for EfficientNet v1-B4 was run on NVIDIA DGX A100 80GB and NVIDIA DGX-1 V100 32GB.

bash ./scripts/B4/training/{AMP, FP32, TF32}/train_benchmark_8x{A100-80G, V100-32G}.sh

Inference performance benchmark

Inference benchmark for EfficientNet v1-B0 was run on NVIDIA DGX A100 80GB and NVIDIA DGX-1 V100 32GB.

Inference benchmark for EfficientNet v1-B4 was run on NVIDIA DGX A100 80GB and NVIDIA DGX-1 V100 32GB.

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference.

Training accuracy results for EfficientNet v1-B0

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the training scripts in the tensorflow:21.09-tf2-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. We evaluated the models using both the original and EMA weights and selected the higher accuracy to report.

GPUs	Accuracy - TF32	Accuracy - mixed precision	Time to train - TF32	Time to train - mixed precision	Time to train speedup (TF32 to mixed precision)
8	77.60%	77.59%	19.5hrs	8.5hrs	2.29
16	77.51%	77.48%	10hrs	4.5hrs	2.22

Training accuracy: NVIDIA DGX-1 (8x V100 32GB)

Our results were obtained by running the training scripts in the tensorflow:21.09-tf2-py3 NGC container on NVIDIA DGX V100 (8x V100 32GB) GPUs. We evaluated the models using both the original and EMA weights and selected the higher accuracy to report.

GPUs	Accuracy - FP32	Accuracy - mixed precision	Time to train - FP32	Time to train - mixed precision	Time to train speedup (FP32 to mixed precision)
8	77.67%	77.69%	49.0hrs	38.0hrs	1.29
32	77.55%	77.53%	11.5hrs	10hrs	1.15

Training accuracy results for EfficientNet v1-B4

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the training scripts in the tensorflow:21.09-tf2-py3 NGC container on multi-node NVIDIA DGX A100 (8x A100 80GB) GPUs. We evaluated the models using both the original and EMA weights and selected the higher accuracy to report.

GPUs	Accuracy - TF32	Accuracy - mixed precision	Time to train - TF32	Time to train - mixed precision	Time to train speedup (TF32 to mixed precision)
32	82.98%	83.13%	38hrs	14hrs	2.00
64	83.14%	83.05%	19hrs	7hrs	2.00

Training accuracy: NVIDIA DGX V100 (8x V100 32GB)

Our results were obtained by running the training scripts in the tensorflow:21.09-tf2-py3 NGC container on NVIDIA DGX V100 (8x A100 32GB) GPUs. We evaluated the models using both the original and EMA weights and selected the higher accuracy to report.

GPUs	Accuracy - FP32	Accuracy - mixed precision	Time to train - TF32	Time to train - mixed precision	Time to train speedup (FP32 to mixed precision)
32	82.64%	82.88%	97.0hrs	41.0hrs	2.37
64	82.74%	83.16%	50.0hrs	20.5hrs	2.43

Training performance results for EfficientNet v1-B0

Training performance: NVIDIA DGX A100 (8x A100 80GB)

GPUs	Throughput - TF32	Throughput - mixed precision	Throughput speedup (TF32 - mixed precision)	Weak scaling - TF32	Weak scaling - mixed precision
1	1209	3454	2.85	1	1
8	9119	20647	2.26	7.54	5.98
16	17815	40644	2.28	14.74	11.77

Training performance: NVIDIA DGX-1 (8x V100 32GB)

GPUs	Throughput - FP32	Throughput - mixed precision	Throughput speedup (FP32 - mixed precision)	Weak scaling - FP32	Weak scaling - mixed precision
1	752	868	1.15	1	1
8	4504	4880	1.08	5.99	5.62
32	15309	18424	1.20	20.36	21.23

Training performance results for EfficientNet v1-B4

Training performance: NVIDIA DGX A100 (8x A100 80GB)

GPUs	Throughput - TF32	Throughput - mixed precision	Throughput speedup (TF32 - mixed precision)	Weak scaling - TF32	Weak scaling - mixed precision
1	165	470	2.85	1	1
8	1308	3550	2.71	7.93	7.55
32	4782	12908	2.70	28.98	27.46
64	9473	25455	2.69	57.41	54.16

Training performance: NVIDIA DGX-1 (8x V100 32GB)

GPUs	Throughput - FP32	Throughput - mixed precision	Throughput speedup (FP32 - mixed precision)	Weak scaling - TF32	Weak scaling - mixed precision
1	79	211	2.67	1	1
8	570	1258	2.21	7.22	5.96
32	1855	4325	2.33	23.48	20.50
64	3568	8643	2.42	45.16	40.96

Inference performance results for EfficientNet v1-B0

Inference performance: NVIDIA DGX A100 (1x A100 80GB)

Our results were obtained by running the inferencing benchmarking script in the tensorflow:21.09-tf2-py3 NGC container on the NVIDIA DGX A100 (1x A100 80GB) GPU.

FP16 Inference Latency

Batch size	Resolution	Throughput Avg	Latency Avg (ms)	Latency 90% (ms)	Latency 95% (ms)	Latency 99% (ms)
1	224x224	110.97	9.09	9.02	9.04	9.09
8	224x224	874.91	9.12	9.04	9.08	9.12
32	224x224	2188.84	14.62	14.35	14.43	14.52
1024	224x224	9729.85	105.24	101.50	103.20	105.24

TF32 Inference Latency

Batch size	Resolution	Throughput Avg	Latency Avg (ms)	Latency 90% (ms)	Latency 95% (ms)	Latency 99% (ms)
1	224x224	127.95	7.88	7.83	7.84	7.87
8	224x224	892.27	8.97	8.88	8.91	8.94
32	224x224	2185.02	14.65	14.33	14.43	14.54
512	224x224	5253.19	97.46	96.57	97.03	97.46

Inference performance: NVIDIA DGX-1 (1x V100 32GB)

FP16 Inference Latency

Batch size	Resolution	Throughput Avg	Latency Avg (ms)	Latency 90% (ms)	Latency 95% (ms)	Latency 99% (ms)
1	224x224	97.53	10.25	10.11	10.13	10.21
8	224x224	752.72	10.63	10.49	10.54	10.59
32	224x224	1768.05	18.10	17.88	17.96	18.04
512	224x224	5399.88	94.82	92.85	93.89	94.82

FP32 Inference Latency

Batch size	Resolution	Throughput Avg	Latency Avg (ms)	Latency 90% (ms)	Latency 95% (ms)	Latency 99% (ms)
1	224x224	97.01	10.31	10.17	10.22	10.28
8	224x224	649.79	12.31	12.16	12.22	12.28
32	224x224	1861.65	17.19	16.98	17.03	17.10
256	224x224	2829.34	90.48	89.80	90.13	90.43

Inference performance results for EfficientNet v1-B4

Inference performance: NVIDIA DGX A100 (1x A100 80GB)

Our results were obtained by running the inferencing benchmarking script in the tensorflow:21.09-tf2-py3 NGC container on the NVIDIA DGX A100 (1x A100 80GB) GPU.

FP16 Inference Latency

Batch size	Resolution	Throughput Avg	Latency Avg (ms)	Latency 90% (ms)	Latency 95% (ms)	Latency 99% (ms)
1	380x380	61.36	16.30	16.20	16.24	16.28
8	380x380	338.60	23.63	23.34	23.46	23.58
32	380x380	971.68	32.93	32.46	32.61	32.76
128	380x380	1497.21	85.28	83.01	83.68	84.70

TF32 Inference Latency

Batch size	Resolution	Throughput Avg	Latency Avg (ms)	Latency 90% (ms)	Latency 95% (ms)	Latency 99% (ms)
1	380x380	60.54	16.52	16.34	16.41	16.49
8	380x380	366.82	21.81	21.48	21.61	21.75
32	380x380	642.78	49.78	49.41	49.53	49.65
64	380x380	714.55	89.54	89.00	89.17	89.34

Inference performance: NVIDIA DGX-1 (1x V100 32GB)

FP16 Inference Latency

Batch size	Resolution	Throughput Avg	Latency Avg (ms)	Latency 90% (ms)	Latency 95% (ms)	Latency 99% (ms)
1	380x380	55.71	17.95	17.68	17.93	17.86
8	380x380	256.72	31.16	30.92	31.02	31.12
16	380x380	350.14	45.75	45.44	45.57	45.68
64	380x380	805.21	79.46	78.74	78.86	79.01

TF32 Inference Latency

Batch size	Resolution	Throughput Avg	Latency Avg (ms)	Latency 90% (ms)	Latency 95% (ms)	Latency 99% (ms)
1	380x380	49.03	20.40	20.03	20.18	20.34
8	380x380	258.21	30.98	30.83	30.89	30.95
16	380x380	310.84	51.47	51.26	51.34	51.42
32	380x380	372.23	85.97	85.70	85.79	85.89