SSD v1.2 for TensorFlow1

NVIDIA Deep Learning Examples

Resource

NVIDIA Deep Learning Examples

SSD v1.2 for TensorFlow1

With a ResNet-50 backbone and a number of architectural modifications, this version provides better accuracy and performance.

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

Training benchmark was run in various scenarios on V100 16G GPU. For each scenario, batch size was set to 32.

To benchmark training, run:

bash examples/SSD320_{PREC}_{NGPU}GPU_BENCHMARK.sh

Where the {NGPU} defines number of GPUs used in benchmark, and the {PREC} defines precision. The benchmark runs training with only 1200 steps and computes average training speed of last 300 steps.

Inference performance benchmark

Inference benchmark was run with various batch-sizes on V100 16G GPU. For inference we are using single GPU setting. Examples are taken from the validation dataset.

To benchmark inference, run:

bash examples/SSD320_FP{16,32}_inference.sh --batch_size <batch size> --checkpoint_dir <path to checkpoint>

Batch size for the inference benchmark is controlled by the --batch_size argument, while the checkpoint is provided to the script with the --checkpoint_dir argument.

The benchmark script provides extra arguments for extra control over the experiment. We were using default values for the extra arguments during the experiments. For more details about them, please run:

bash examples/SSD320_FP16_inference.sh --help

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference.

Training accuracy results

Training accuracy: NVIDIA DGX A100 (8x A100 40GB)

Our results were obtained by running the ./examples/SSD320_FP{16,32}_{1,4,8}GPU.sh script in the TensorFlow-20.06-py3 NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs.

All the results are obtained with batch size set to 32.

Number of GPUs	Mixed precision mAP	Training time with mixed precision	TF32 mAP	Training time with TF32
1	0.279	4h 48min	0.280	6h 40min
4	0.280	1h 20min	0.279	1h 53min
8	0.281	0h 53min	0.282	1h 05min

Training accuracy: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running the ./examples/SSD320_FP{16,32}_{1,4,8}GPU.sh script in the TensorFlow-20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. All the results are obtained with batch size set to 32.

Number of GPUs	Mixed precision mAP	Training time with mixed precision	FP32 mAP	Training time with FP32
1	0.279	7h 36min	0.278	10h 38min
4	0.277	2h 18min	0.279	2h 58min
8	0.280	1h 28min	0.282	1h 55min

Here are example graphs of TF32, FP32 and FP16 training on 8 GPU configuration:

TrainingLoss

Training performance results

Training performance: NVIDIA DGX A100 (8x A100 40GB)

Our results were obtained by running:

python bash examples/SSD320_FP*GPU_BENCHMARK.sh

scripts in the TensorFlow-20.06-py3 NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs.

Number of GPUs	Batch size per GPU	Mixed precision img/s	TF32 img/s	Speed-up with mixed precision	Multi-gpu weak scaling with mixed precision	Multi-gpu weak scaling with TF32
1	32	180.55	123.48	1.46	1.00	1.00
4	32	624.35	449.17	1.39	3.46	3.64
8	32	1008.46	779.96	1.29	5.59	6.32

To achieve same results, follow the Quick start guide outlined above.

Those results can be improved when XLA is used in conjunction with mixed precision, delivering up to 2x speedup over FP32 on a single GPU (~179 img/s). However XLA is still considered experimental.

Training performance: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running:

python bash examples/SSD320_FP*GPU_BENCHMARK.sh

scripts in the TensorFlow-20.06-py3 NGC container on NVIDIA DGX-1 with V100 16G GPUs.

Number of GPUs	Batch size per GPU	Mixed precision img/s	FP32 img/s	Speed-up with mixed precision	Multi-gpu weak scaling with mixed precision	Multi-gpu weak scaling with FP32
1	32	127.96	84.96	1.51	1.00	1.00
4	32	396.38	283.30	1.40	3.10	3.33
8	32	676.83	501.30	1.35	5.29	5.90

To achieve same results, follow the Quick start guide outlined above.

Those results can be improved when XLA is used in conjunction with mixed precision, delivering up to 2x speedup over FP32 on a single GPU (~179 img/s). However XLA is still considered experimental.

Inference performance results

Inference performance: NVIDIA DGX A100 (1x A100 40GB)

Our results were obtained by running the examples/SSD320_FP{16,32}_inference.sh script in the TensorFlow-20.06-py3 NGC container on NVIDIA DGX A100 (1x A100 40GB) GPU.

FP16

Batch size	Throughput Avg	Latency Avg	Latency 90%	Latency 95%	Latency 99%
1	40.88	24.46	25.76	26.47	27.91
2	49.26	40.60	42.09	42.61	45.26
4	58.81	68.01	73.12	76.02	80.38
8	69.13	115.73	121.58	123.87	129.00
16	78.10	204.85	212.40	216.38	225.80
32	76.19	420.00	437.24	443.21	479.80
64	77.92	821.37	840.82	867.62	1204.64

TF32

Batch size	Throughput Avg	Latency Avg	Latency 90%	Latency 95%	Latency 99%
1	36.93	27.08	29.10	29.89	32.24
2	44.03	45.42	48.67	49.56	51.12
4	54.65	73.20	77.50	78.89	85.81
8	62.96	127.06	137.04	141.64	152.92
16	71.48	223.83	231.36	233.35	247.51
32	73.11	437.71	450.86	455.14	467.11
64	73.74	867.88	898.99	912.07	1077.13

To achieve same results, follow the Quick start guide outlined above.

Inference performance: NVIDIA DGX-1 (1x V100 16GB)

Our results were obtained by running the examples/SSD320_FP{16,32}_inference.sh script in the TensorFlow-20.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPU.

FP16

Batch size	Throughput Avg	Latency Avg	Latency 90%	Latency 95%	Latency 99%
1	28.34	35.29	38.09	39.06	41.07
2	41.21	48.54	52.77	54.45	57.10
4	55.41	72.19	75.44	76.99	84.15
8	61.83	129.39	133.37	136.89	145.69
16	66.36	241.12	246.05	249.47	259.79
32	65.01	492.21	510.01	516.45	526.83
64	64.75	988.47	1012.11	1026.19	1290.54

FP32

Batch size	Throughput Avg	Latency Avg	Latency 90%	Latency 95%	Latency 99%
1	29.15	34.31	36.26	37.63	39.95
2	41.20	48.54	53.08	54.47	57.32
4	50.72	78.86	82.49	84.08	92.15
8	55.72	143.57	147.20	148.92	152.44
16	59.41	269.32	278.30	281.06	286.54
32	59.81	534.99	542.49	551.58	572.16
64	58.93	1085.96	1111.20	1118.21	1253.74

To achieve same results, follow the Quick start guide outlined above.

Inference performance: NVIDIA T4

Our results were obtained by running the examples/SSD320_FP{16,32}_inference.sh script in the TensorFlow-20.06-py3 NGC container on NVIDIA T4.

FP16

Batch size	Throughput Avg	Latency Avg	Latency 90%	Latency 95%	Latency 99%
1	19.29	51.90	53.77	54.95	59.21
2	30.36	66.04	70.13	71.49	73.97
4	37.71	106.21	111.32	113.04	118.03
8	40.95	195.49	201.66	204.00	210.32
16	41.04	390.05	399.73	402.88	410.02
32	40.36	794.48	815.81	825.39	841.45
64	40.27	1590.98	1631.00	1642.22	1838.95

FP32

Batch size	Throughput Avg	Latency Avg	Latency 90%	Latency 95%	Latency 99%
1	14.30	69.99	72.30	73.29	76.35
2	20.04	99.87	104.50	106.03	108.15
4	25.01	159.99	163.00	164.13	168.63
8	28.42	281.58	286.57	289.01	294.37
16	32.56	492.08	501.98	505.29	509.95
32	34.14	939.11	961.35	968.26	983.77
64	33.47	1915.36	1971.90	1992.24	2030.54

To achieve same results, follow the Quick start guide outlined above.