Mask R-CNN for TensorFlow2

NVIDIA Deep Learning Examples

Resource

NVIDIA Deep Learning Examples

Mask R-CNN for TensorFlow2

Mask R-CNN is a convolution based network for object instance segmentation. This implementation provides 1.3x faster training while maintaining target accuracy.

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

To run training benchmarking on a selected number of GPUs with either AMP or TF32/FP32 precision, run the following script:

python scripts/benchmark_training.py --gpus {1,8} --batch_size {2,4} [--amp]

Inference performance benchmark

To run inference benchmarking on a single GPU with either AMP or TF32/FP32 precision, run the following script:

python scripts/benchmark_inference.py --batch_size {2,4,8} [--amp]

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference.

Training accuracy results

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the python scripts/train.py --gpus 8 --batch_size 4 [--amp] training script in the TensorFlow 2.x 21.02-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.

GPUs	Batch size / GPU	Precision	Final AP BBox	Final AP Segm	Time to train [h]	Time to train speedup
8	2	TF32	0.3796	0.3444	4.81	-
8	2	AMP	0.3795	0.3443	3.77	1.27

Training accuracy: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running the python scripts/train.py --gpus 8 --batch_size 2 [--amp] training script in the TensorFlow 2.x 21.02-py3 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.

GPUs	Batch size / GPU	Precision	Final AP BBox	Final AP Segm	Time to train [h]	Time to train speedup
8	2	FP32	0.3793	0.3442	11.37	-
8	2	AMP	0.3792	0.3444	9.01	1.26

Learning curves

The following image shows the training loss as a function of iteration for training using DGX A100 (TF32 and TF-AMP) and DGX-1 V100 (FP32 and TF-AMP).

LearningCurves

Training performance results

Training performance: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the python scripts/benchmark_training.py --gpus {1,8} --batch_size {4,8,16} [--amp] training script in the TensorFlow 2.x 21.02-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers (in images per second) were averaged over 200 steps omitting the first 100 warm-up steps.

GPUs	Batch size / GPU	Throughput - TF32 [img/s]	Throughput - mixed precision [img/s]	Throughput speedup (TF32 - mixed precision)	Weak scaling - TF32	Weak scaling - mixed precision
1	2	13.44	18.26	1.35	-	-
1	4	18.41	28.58	1.55	-	-
8	2	84.29	87.31	1.03	6.27	4.78
8	4	103.80	114.45	1.10	5.63	4.04

To achieve these same results, follow the steps in the Quick Start Guide.

Training performance: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running the python scripts/benchmark_training.py --gpus {1,8} --batch_size {2,4} [--amp] training script in the TensorFlow 2.x 21.02-py3 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs. Performance numbers (in images per second) were averaged over 200 steps omitting the first 100 warm-up steps.

GPUs	Batch size / GPU	Throughput - FP32 [img/s]	Throughput - mixed precision [img/s]	Throughput speedup (FP32 - mixed precision)	Weak scaling - FP32	Weak scaling - mixed precision
1	2	7.57	14.47	1.91	-	-
1	4	8.51	19.35	2.27	-	-
8	2	44.55	53.40	1.37	5.26	3.69
8	4	50.56	58.33	1.15	6.67	4.03

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance results

Inference performance: NVIDIA DGX A100 (1x A100 80GB)

Our results were obtained by running the python scripts/benchmark_inference.py --batch_size {8,16,24} [--amp] benchmarking script in the TensorFlow 2.x 21.02-py3 NGC container on NVIDIA DGX A100 (1x A100 80GB) GPU.

TF32

Batch size	Throughput Avg [img/s]	Latency Avg	Latency 90%	Latency 95%	Latency 99%
6	39.23	0.1530	0.1540	0.1542	0.1546
12	42.55	0.2654	0.2840	0.2875	0.2945
24	47.92	0.5007	0.5248	0.5294	0.5384

FP16

Batch size	Throughput Avg [img/s]	Latency Avg	Latency 90%	Latency 95%	Latency 99%
6	60.79	0.0987	0.0988	0.1000	0.1005
12	76.23	0.1574	0.1614	0.1621	0.1636
24	80.67	0.2975	0.3025	0.3035	0.3054

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance: NVIDIA DGX-1 (1x V100 16GB)

Our results were obtained by running the python scripts/benchmark_inference.py --batch_size {6,12,24} [--amp] benchmarking script in the TensorFlow 2.x 21.02-py3 NGC container on NVIDIA DGX-1 with (1x V100 16GB) GPU.

FP32

Batch size	Throughput Avg [img/s]	Latency Avg	Latency 90%	Latency 95%	Latency 99%
6	18.56	0.3234	0.3263	0.3269	0.3280
12	20.50	0.5854	0.5920	0.5933	0.5958
24	OOM	-	-	-	-

FP16

Batch size	Throughput Avg [img/s]	Latency Avg	Latency 90%	Latency 95%	Latency 99%
6	35.46	0.1692	0.1705	0.1707	0.1712
12	41.44	0.2896	0.2937	0.2945	0.2960
24	42.53	0.5643	0.5718	0.5733	0.5761

To achieve these same results, follow the steps in the Quick Start Guide.