NGC | Catalog
CatalogResourcesMask R-CNN for TensorFlow2

Mask R-CNN for TensorFlow2

For downloads and more information, please view on a desktop device.
Logo for Mask R-CNN for TensorFlow2

Description

Mask R-CNN is a convolution based network for object instance segmentation.

Publisher

NVIDIA

Latest Version

21.02.4

Modified

April 4, 2023

Compressed Size

575.16 KB

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

To run training benchmarking on a selected number of GPUs with either AMP or TF32/FP32 precision, run the following script:

python scripts/benchmark_training.py --gpus {1,8} --batch_size {2,4} [--amp]

Inference performance benchmark

To run inference benchmarking on a single GPU with either AMP or TF32/FP32 precision, run the following script:

python scripts/benchmark_inference.py --batch_size {2,4,8} [--amp]

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference.

Training accuracy results

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the python scripts/train.py --gpus 8 --batch_size 4 [--amp] training script in the TensorFlow 2.x 21.02-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.

GPUs Batch size / GPU Precision Final AP BBox Final AP Segm Time to train [h] Time to train speedup
8 2 TF32 0.3796 0.3444 4.81 -
8 2 AMP 0.3795 0.3443 3.77 1.27
Training accuracy: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running the python scripts/train.py --gpus 8 --batch_size 2 [--amp] training script in the TensorFlow 2.x 21.02-py3 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.

GPUs Batch size / GPU Precision Final AP BBox Final AP Segm Time to train [h] Time to train speedup
8 2 FP32 0.3793 0.3442 11.37 -
8 2 AMP 0.3792 0.3444 9.01 1.26

Learning curves

The following image shows the training loss as a function of iteration for training using DGX A100 (TF32 and TF-AMP) and DGX-1 V100 (FP32 and TF-AMP).

LearningCurves

Training performance results

Training performance: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the python scripts/benchmark_training.py --gpus {1,8} --batch_size {4,8,16} [--amp] training script in the TensorFlow 2.x 21.02-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers (in images per second) were averaged over 200 steps omitting the first 100 warm-up steps.

GPUs Batch size / GPU Throughput - TF32 [img/s] Throughput - mixed precision [img/s] Throughput speedup (TF32 - mixed precision) Weak scaling - TF32 Weak scaling - mixed precision
1 2 13.44 18.26 1.35 - -
1 4 18.41 28.58 1.55 - -
8 2 84.29 87.31 1.03 6.27 4.78
8 4 103.80 114.45 1.10 5.63 4.04

To achieve these same results, follow the steps in the Quick Start Guide.

Training performance: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running the python scripts/benchmark_training.py --gpus {1,8} --batch_size {2,4} [--amp] training script in the TensorFlow 2.x 21.02-py3 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs. Performance numbers (in images per second) were averaged over 200 steps omitting the first 100 warm-up steps.

GPUs Batch size / GPU Throughput - FP32 [img/s] Throughput - mixed precision [img/s] Throughput speedup (FP32 - mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
1 2 7.57 14.47 1.91 - -
1 4 8.51 19.35 2.27 - -
8 2 44.55 53.40 1.37 5.26 3.69
8 4 50.56 58.33 1.15 6.67 4.03

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance results

Inference performance: NVIDIA DGX A100 (1x A100 80GB)

Our results were obtained by running the python scripts/benchmark_inference.py --batch_size {8,16,24} [--amp] benchmarking script in the TensorFlow 2.x 21.02-py3 NGC container on NVIDIA DGX A100 (1x A100 80GB) GPU.

TF32

Batch size Throughput Avg [img/s] Latency Avg Latency 90% Latency 95% Latency 99%
6 39.23 0.1530 0.1540 0.1542 0.1546
12 42.55 0.2654 0.2840 0.2875 0.2945
24 47.92 0.5007 0.5248 0.5294 0.5384

FP16

Batch size Throughput Avg [img/s] Latency Avg Latency 90% Latency 95% Latency 99%
6 60.79 0.0987 0.0988 0.1000 0.1005
12 76.23 0.1574 0.1614 0.1621 0.1636
24 80.67 0.2975 0.3025 0.3035 0.3054

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance: NVIDIA DGX-1 (1x V100 16GB)

Our results were obtained by running the python scripts/benchmark_inference.py --batch_size {6,12,24} [--amp] benchmarking script in the TensorFlow 2.x 21.02-py3 NGC container on NVIDIA DGX-1 with (1x V100 16GB) GPU.

FP32

Batch size Throughput Avg [img/s] Latency Avg Latency 90% Latency 95% Latency 99%
6 18.56 0.3234 0.3263 0.3269 0.3280
12 20.50 0.5854 0.5920 0.5933 0.5958
24 OOM - - - -

FP16

Batch size Throughput Avg [img/s] Latency Avg Latency 90% Latency 95% Latency 99%
6 35.46 0.1692 0.1705 0.1707 0.1712
12 41.44 0.2896 0.2937 0.2945 0.2960
24 42.53 0.5643 0.5718 0.5733 0.5761

To achieve these same results, follow the steps in the Quick Start Guide.