The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference). ### Benchmarking The following section shows how to run benchmarks measuring the model performance in training and inference modes. #### Training performance benchmark To benchmark the training performance on a specific batch size, run: * For 1 GPU * FP32 / TF32 `python ./main.py --arch=se-resnext101-32x4d --mode=training_benchmark --warmup_steps 200 --batch_size --data_dir= --results_dir=` * AMP `python ./main.py --arch=se-resnext101-32x4d --mode=training_benchmark --amp --warmup_steps 200 --batch_size --data_dir= --results_dir=` * For multiple GPUs * FP32 / TF32 `mpiexec --allow-run-as-root --bind-to socket -np python ./main.py --arch=se-resnext101-32x4d --mode=training_benchmark --batch_size --data_dir= --results_dir=` * AMP `mpiexec --allow-run-as-root --bind-to socket -np python ./main.py --arch=se-resnext101-32x4d --mode=training_benchmark --amp --batch_size --data_dir= --results_dir=` Each of these scripts runs 200 warm-up iterations and measures the first epoch. To control warmup and benchmark length, use the `--warmup_steps`, `--num_iter` and `--iter_unit` flags. Features like XLA or DALI can be controlled with `--xla` and `--dali` flags. For proper throughput reporting the value of `--num_iter` must be greater than `--warmup_steps` value. Suggested batch sizes for training are 96 for mixed precision training and 64 for single precision training per single V100 16 GB. If no `--data_dir=` flag is specified then the benchmarks will use a synthetic dataset. The resolution of synthetic images used can be controlled with `--synthetic_data_size` flag. #### Inference performance benchmark To benchmark the inference performance on a specific batch size, run: * FP32 / TF32 `python ./main.py --arch=se-resnext101-32x4d --mode=inference_benchmark --warmup_steps 20 --num_iter 100 --iter_unit batch --batch_size --data_dir= --results_dir=` * AMP `python ./main.py --arch=se-resnext101-32x4d --mode=inference_benchmark --amp --warmup_steps 20 --num_iter 100 --iter_unit batch --batch_size --data_dir= --results_dir=` By default, each of these scripts runs 20 warm-up iterations and measures the next 80 iterations. To control warm-up and benchmark length, use the `--warmup_steps`, `--num_iter` and `--iter_unit` flags. If no `--data_dir=` flag is specified then the benchmarks will use a synthetic dataset. The benchmark can be automated with the `inference_benchmark.sh` script provided in `se-resnext101-32x4d`, by simply running: `bash ./se-resnext101-32x4d/inference_benchmark.sh ` The `` parameter refers to the input data directory (by default `/data/tfrecords` inside the container). By default, the benchmark tests the following configurations: **FP32**, **AMP**, **AMP + XLA** with different batch sizes. When the optional directory with the DALI index files `` is specified, the benchmark executes an additional **DALI + AMP + XLA** configuration. For proper throughput reporting the value of `--num_iter` must be greater than `--warmup_steps` value. For performance benchamrk of raw model, synthetic dataset can be used. To use synthetic dataset, use `--synthetic_data_size` flag instead of `--data_dir` to specify input image size. ### Results The following sections provide details on how we achieved our performance and accuracy in training and inference. #### Training accuracy results ##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB) Our results were obtained by running the `/se-resnet50v1.5/training/DGXA100_RN50_{PRECISION}_90E.sh` training script in the [TensorFlow 20.06-tf1-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow) NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs. | Epochs | Batch Size / GPU | Accuracy - TF32 (top1) | Accuracy - mixed precision (top1) | |--------|------------------|-----------------|----------------------------| | 90 | 128 (TF32) / 256 (AMP) | 79.73 | 79.60 | ##### Training accuracy: NVIDIA DGX-1 (8x V100 16G) Our results were obtained by running the `/se-resnext101-32x4d/training/{/DGX1_RNxt101-32x4d_{PRECISION}_{EPOCHS}E.sh` training script in the [TensorFlow 20.06-tf1-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow) NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs. | Epochs | Batch Size / GPU | Accuracy - FP32 | Accuracy - mixed precision | |--------|------------------|-----------------|----------------------------| | 90 | 64 (FP32) / 96 (AMP) | 79.69 | 79.81 | | 250 | 64 (FP32) / 96 (AMP) | 80.87 | 80.84 | **Example training loss plot** ![TrainingLoss](https://github.com/NVIDIA/DeepLearningExamples/raw/master/TensorFlow/Classification/ConvNets/se-resnext101-32x4d/imgs/train_loss.png) #### Training performance results ##### Training performance: NVIDIA DGX A100 (8x A100 40GB) Our results were obtained by running the `se-resnext101-32x4d/training/training_perf.sh` benchmark script in the [TensorFlow 20.06-tf1-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow) NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs. Performance numbers (in images per second) were averaged over an entire training epoch. | GPUs | Batch Size / GPU | Throughput - TF32 + XLA | Throughput - mixed precision + XLA | Throughput speedup (TF32 - mixed precision) | Weak scaling - TF32 + XLA | Weak scaling - mixed precision + XLA | |----|---------------|---------------|------------------------|-----------------|-----------|-------------------| | 1 | 128 (TF) / 256 (AMP) | 342 img/s | 975 img/s | 2.86x | 1.00x | 1.00x | | 8 | 128 (TF) / 256 (AMP) | 2610 img/s | 7230 img/s | 2.77x | 7.63x | 7.41x | ##### Training performance: NVIDIA DGX-1 (8x V100 16G) Our results were obtained by running the `se-resnext101-32x4d/training/training_perf.sh` benchmark script in the [TensorFlow 20.06-tf1-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow) NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs. Performance numbers (in images per second) were averaged over an entire training epoch. | GPUs | Batch Size / GPU | Throughput - FP32 + XLA | Throughput - mixed precision + XLA | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 + XLA | Weak scaling - mixed precision + XLA | |----|---------------|---------------|-----------------------|---------------|-----------|-------| | 1 | 64 (FP32) / 96 (AMP) | 152 img/s | 475 img/s | 3.12x | 1.00x | 1.00x | | 8 | 64 (FP32) / 96 (AMP) | 1120 img/s | 3360 img/s | 3.00x | 7.37x | 7.07x | ##### Training performance: NVIDIA DGX-2 (16x V100 32G) Our results were obtained by running the `se-resnext101-32x4d/training/training_perf.sh` benchmark script in the [TensorFlow 20.06-tf1-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow) NGC container on NVIDIA DGX-2 with (16x V100 32G) GPUs. Performance numbers (in images per second) were averaged over an entire training epoch. | GPUs | Batch Size / GPU | Throughput - FP32 + XLA | Throughput - mixed precision + XLA | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 + XLA | Weak scaling - mixed precision + XLA | |----|---------------|---------------|-------------------------|-------|--------|--------| | 1 | 64 (FP32) / 96 (AMP) | 158 img/s | 472 img/s | 2.98x | 1.00x | 1.00x | | 16 | 64 (FP32) / 96 (AMP) | 2270 img/s| 6580 img/s | 2.89x | 14.36x | 13.94x | #### Training Time for 90 Epochs ##### Training time: NVIDIA DGX A100 (8x A100 40GB) Our results were estimated based on the [training performance results](#training-performance-nvidia-dgx-a100-8x-a100-40g) on NVIDIA DGX A100 with (8x A100 40G) GPUs. | GPUs | Time to train - mixed precision + XLA | Time to train - TF32 + XLA | |---|--------|---------| | 1 | ~36h | ~102h | | 8 | ~5h | ~14h | ##### Training time: NVIDIA DGX-1 (8x V100 16G) Our results were estimated based on the [training performance results](#training-performance-nvidia-dgx-1-8x-v100-16g) on NVIDIA DGX-1 with (8x V100 16G) GPUs. | GPUs | Time to train - mixed precision + XLA | Time to train - FP32 + XLA | |---|--------|---------| | 1 | ~68h | ~210h | | 8 | ~10h | ~29h | ##### Training time: NVIDIA DGX-2 (16x V100 32G) Our results were estimated based on the [training performance results](#training-performance-nvidia-dgx-2-16x-v100-32g) on NVIDIA DGX-2 with (16x V100 32G) GPUs. | GPUs | Time to train - mixed precision + XLA | Time to train - FP32 + XLA | |----|-------|-------| | 1 | ~68h | ~202h | | 16 | ~5h | ~14h | #### Inference performance results ##### Inference performance: NVIDIA DGX A100 (1x A100 40GB) Our results were obtained by running the `inference_benchmark.sh` inferencing benchmarking script in the [TensorFlow 20.06-tf1-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow) NGC container on NVIDIA DGX A100 with (1x A100 40G) GPU. **TF32 Inference Latency** |**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| |--------------|------------------|---------------|---------------|---------------|---------------| | 1 | 95.32 img/s | 10.52 ms | 10.52 ms | 10.55 ms | 11.10 ms | | 2 | 169.59 img/s | 11.82 ms | 11.83 ms | 11.92 ms | 12.56 ms | | 4 | 258.97 img/s | 15.45 ms | 15.70 ms | 15.78 ms | 16.22 ms | | 8 | 355.09 img/s | 22.53 ms | 22.74 ms | 22.84 ms | 23.17 ms | | 16 | 561.11 img/s | 28.52 ms | 28.85 ms | 29.09 ms | 29.50 ms | | 32 | 698.94 img/s | 45.78 ms | 46.36 ms | 46.56 ms | 46.87 ms | | 64 | 751.17 img/s | 85.21 ms | 86.74 ms | 87.27 ms | 87.95 ms | | 128 | 802.64 img/s | 159.47 ms | 160.01 ms | 160.35 ms | 161.42 ms | | 256 | 840.72 img/s | 304.50 ms | 305.87 ms | 306.11 ms | 306.57 ms | **TF32 Inference Latency + XLA** |**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| |--------------|------------------|---------------|---------------|---------------|---------------| | 1 | 92.46 img/s | 10.84 ms | 10.90 ms | 10.96 ms | 11.14 ms | | 2 | 161.55 img/s | 12.40 ms | 12.44 ms | 12.51 ms | 12.62 ms | | 4 | 237.41 img/s | 16.88 ms | 17.54 ms | 17.79 ms | 18.25 ms | | 8 | 358.39 img/s | 22.35 ms | 23.56 ms | 24.29 ms | 25.53 ms | | 16 | 577.33 img/s | 27.72 ms | 28.64 ms | 28.92 ms | 29.22 ms | | 32 | 800.81 img/s | 39.97 ms | 40.93 ms | 41.42 ms | 41.87 ms | | 64 | 921.00 img/s | 69.64 ms | 70.44 ms | 70.90 ms | 79.54 ms | | 128 | 1024.70 img/s | 124.99 ms | 125.70 ms | 126.10 ms | 138.57 ms | | 256 | 1089.80 img/s | 234.90 ms | 236.02 ms | 236.37 ms | 237.26 ms | **Mixed Precision Inference Latency** |**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| |--------------|------------------|---------------|---------------|---------------|---------------| | 1 | 84.06 img/s | 11.92 ms | 11.94 ms | 11.96 ms | 12.08 ms | | 2 | 170.38 img/s | 11.76 ms | 11.82 ms | 11.87 ms | 11.94 ms | | 4 | 336.09 img/s | 11.93 ms | 12.06 ms | 12.17 ms | 12.62 ms | | 8 | 669.91 img/s | 11.94 ms | 12.33 ms | 12.47 ms | 12.88 ms | | 16 | 1119.49 img/s | 14.36 ms | 14.86 ms | 15.11 ms | 16.11 ms | | 32 | 1482.46 img/s | 21.66 ms | 22.04 ms | 22.38 ms | 23.72 ms | | 64 | 1680.85 img/s | 38.09 ms | 39.02 ms | 39.34 ms | 41.02 ms | | 128 | 1728.27 img/s | 74.30 ms | 74.92 ms | 75.22 ms | 75.60 ms | | 256 | 1761.56 img/s | 145.33 ms | 146.54 ms | 146.83 ms | 147.34 ms | **Mixed Precision Inference Latency + XLA** |**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| |--------------|------------------|---------------|---------------|---------------|---------------| | 1 | 74.83 img/s | 13.39 ms | 13.45 ms | 13.49 ms | 13.57 ms | | 2 | 135.28 img/s | 14.81 ms | 14.98 ms | 15.10 ms | 16.19 ms | | 4 | 272.18 img/s | 14.70 ms | 15.07 ms | 15.30 ms | 15.80 ms | | 8 | 517.69 img/s | 15.50 ms | 16.63 ms | 17.05 ms | 18.10 ms | | 16 | 1050.03 img/s | 15.38 ms | 16.84 ms | 17.49 ms | 17.97 ms | | 32 | 1781.06 img/s | 18.27 ms | 19.54 ms | 20.00 ms | 25.94 ms | | 64 | 2551.55 img/s | 25.26 ms | 26.03 ms | 26.62 ms | 29.67 ms | | 128 | 2834.59 img/s | 45.50 ms | 46.85 ms | 47.72 ms | 54.91 ms | | 256 | 3367.18 img/s | 76.03 ms | 77.06 ms | 77.36 ms | 78.13 ms | ##### Inference performance: NVIDIA DGX-1 (1x V100 16G) Our results were obtained by running the `inference_benchmark.sh` inferencing benchmarking script in the [TensorFlow 20.06-tf1-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow) NGC container on NVIDIA DGX-1 with (1x V100 16G) GPU. **FP32 Inference Latency** |**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| |--------------|------------------|---------------|---------------|---------------|---------------| | 1 | 75.72 img/s | 13.25 ms | 13.38 ms | 13.50 ms | 13.66 ms | | 2 | 112.58 img/s | 17.90 ms | 20.74 ms | 20.91 ms | 21.87 ms | | 4 | 191.09 img/s | 20.93 ms | 21.05 ms | 21.09 ms | 21.27 ms | | 8 | 235.39 img/s | 33.98 ms | 34.14 ms | 34.19 ms | 34.28 ms | | 16 | 315.24 img/s | 50.76 ms | 50.96 ms | 51.01 ms | 51.32 ms | | 32 | 376.05 img/s | 85.09 ms | 85.56 ms | 85.71 ms | 86.40 ms | | 64 | 427.39 img/s | 149.84 ms | 150.08 ms | 150.37 ms | 161.87 ms | | 128 | 460.82 img/s | 277.76 ms | 278.97 ms | 279.48 ms | 280.95 ms | **Mixed Precision Inference Latency** |**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| |--------------|------------------|---------------|---------------|---------------|---------------| | 1 | 66.44 img/s | 15.10 ms | 15.17 ms | 15.25 ms | 16.01 ms | | 2 | 132.33 img/s | 15.16 ms | 15.32 ms | 15.37 ms | 15.50 ms | | 4 | 273.84 img/s | 14.63 ms | 15.14 ms | 15.83 ms | 17.38 ms | | 8 | 509.35 img/s | 15.71 ms | 16.10 ms | 16.21 ms | 16.55 ms | | 16 | 770.02 img/s | 20.78 ms | 20.96 ms | 21.03 ms | 21.24 ms | | 32 | 926.46 img/s | 34.55 ms | 34.88 ms | 35.05 ms | 36.32 ms | | 64 | 1039.74 img/s | 61.55 ms | 61.82 ms | 61.99 ms | 62.32 ms | | 128 | 1102.00 img/s | 116.15 ms | 116.62 ms | 116.80 ms | 116.97 ms | **Mixed Precision Inference Latency + XLA** |**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| |--------------|------------------|---------------|---------------|---------------|---------------| | 1 | 58.55 img/s | 17.12 ms | 17.21 ms | 17.28 ms | 17.42 ms | | 2 | 105.00 img/s | 19.10 ms | 19.29 ms | 19.36 ms | 19.67 ms | | 4 | 207.60 img/s | 19.31 ms | 19.59 ms | 19.67 ms | 19.84 ms | | 8 | 413.16 img/s | 19.37 ms | 19.77 ms | 19.87 ms | 20.24 ms | | 16 | 739.12 img/s | 21.80 ms | 24.48 ms | 24.71 ms | 26.93 ms | | 32 | 1196.83 img/s | 26.99 ms | 27.10 ms | 27.49 ms | 28.80 ms | | 64 | 1470.31 img/s | 43.74 ms | 44.02 ms | 44.18 ms | 46.28 ms | | 128 | 1683.63 img/s | 76.03 ms | 77.00 ms | 77.23 ms | 78.15 ms | ##### Inference performance: NVIDIA DGX-2 (1x V100 32G) Our results were obtained by running the `inference_benchmark.sh` inferencing benchmarking script in the [TensorFlow 20.06-tf1-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow) NGC container on NVIDIA DGX-2 with (1x V100 32G) GPU. **FP32 Inference Latency** |**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| |--------------|------------------|---------------|---------------|---------------|---------------| | 1 | 71.44 img/s | 14.07 ms | 14.22 ms | 14.43 ms | 16.44 ms | | 2 | 149.68 img/s | 13.43 ms | 13.79 ms | 13.94 ms | 16.63 ms | | 4 | 183.01 img/s | 21.85 ms | 22.12 ms | 22.18 ms | 22.44 ms | | 8 | 220.67 img/s | 36.25 ms | 36.84 ms | 37.17 ms | 37.43 ms | | 16 | 310.27 img/s | 51.57 ms | 51.88 ms | 52.09 ms | 53.37 ms | | 32 | 381.41 img/s | 83.89 ms | 84.30 ms | 84.66 ms | 85.04 ms | | 64 | 440.37 img/s | 145.45 ms | 145.49 ms | 145.86 ms | 147.53 ms | | 128 | 483.84 img/s | 264.54 ms | 265.04 ms | 265.46 ms | 266.43 ms | **Mixed Precision Inference Latency** |**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| |--------------|------------------|---------------|---------------|---------------|---------------| | 1 | 73.06 img/s | 13.74 ms | 14.07 ms | 14.20 ms | 14.35 ms | | 2 | 155.23 img/s | 12.95 ms | 13.13 ms | 13.33 ms | 15.49 ms | | 4 | 303.68 img/s | 13.23 ms | 13.38 ms | 13.46 ms | 14.34 ms | | 8 | 583.43 img/s | 13.72 ms | 13.90 ms | 14.08 ms | 15.47 ms | | 16 | 783.30 img/s | 20.43 ms | 20.66 ms | 21.31 ms | 21.97 ms | | 32 | 932.10 img/s | 34.34 ms | 34.71 ms | 34.81 ms | 35.70 ms | | 64 | 1058.07 img/s | 60.48 ms | 60.75 ms | 60.94 ms | 62.49 ms | | 128 | 1129.65 img/s | 113.30 ms | 113.53 ms | 113.66 ms | 114.81 ms | **Mixed Precision Inference Latency + XLA** |**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| |--------------|------------------|---------------|---------------|---------------|---------------| | 1 | 66.43 img/s | 15.14 ms | 15.24 ms | 15.31 ms | 19.18 ms | | 2 | 122.85 img/s | 16.39 ms | 18.28 ms | 18.45 ms | 20.33 ms | | 4 | 247.80 img/s | 16.14 ms | 16.44 ms | 16.57 ms | 17.24 ms | | 8 | 498.19 img/s | 16.07 ms | 16.26 ms | 16.66 ms | 17.70 ms | | 16 | 831.20 img/s | 19.40 ms | 19.30 ms | 19.39 ms | 25.41 ms | | 32 | 1223.75 img/s | 26.42 ms | 26.31 ms | 26.70 ms | 29.88 ms | | 64 | 1520.64 img/s | 42.09 ms | 42.45 ms | 42.57 ms | 42.84 ms | | 128 | 1739.61 img/s | 73.58 ms | 73.98 ms | 74.17 ms | 74.72 ms | ##### Inference performance: NVIDIA T4 (1x T4 16G) Our results were obtained by running the `inference_benchmark.sh` inferencing benchmarking script in the [TensorFlow 20.06-tf1-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow) NGC container on NVIDIA T4 with (1x T4 16G) GPU. **FP32 Inference Latency** |**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| |--------------|------------------|---------------|---------------|---------------|---------------| | 1 | 27.39 img/s | 36.68 ms | 38.85 ms | 39.01 ms | 40.40 ms | | 2 | 44.56 img/s | 44.96 ms | 46.25 ms | 46.92 ms | 48.92 ms | | 4 | 65.11 img/s | 61.43 ms | 62.22 ms | 62.93 ms | 65.01 ms | | 8 | 80.09 img/s | 99.88 ms | 100.34 ms | 100.85 ms | 101.79 ms | | 16 | 93.98 img/s | 170.24 ms | 170.72 ms | 171.27 ms | 171.98 ms | | 32 | 99.86 img/s | 320.42 ms | 320.99 ms | 321.37 ms | 322.28 ms | | 64 | 103.31 img/s | 619.44 ms | 620.08 ms | 620.55 ms | 622.19 ms | | 128 | 105.16 img/s | 1217.18 ms | 1218.09 ms | 1218.59 ms | 1221.16 ms | **Mixed Precision Inference Latency** |**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| |--------------|------------------|---------------|---------------|---------------|---------------| | 1 | 57.21 img/s | 17.57 ms | 18.06 ms | 18.15 ms | 20.74 ms | | 2 | 80.34 img/s | 24.97 ms | 25.38 ms | 25.69 ms | 27.12 ms | | 4 | 115.12 img/s | 34.77 ms | 35.61 ms | 36.74 ms | 37.61 ms | | 8 | 147.51 img/s | 54.24 ms | 54.79 ms | 55.28 ms | 58.25 ms | | 16 | 173.83 img/s | 92.04 ms | 92.50 ms | 93.26 ms | 94.72 ms | | 32 | 182.19 img/s | 175.64 ms | 176.51 ms | 177.44 ms | 178.52 ms | | 64 | 193.20 img/s | 331.25 ms | 332.56 ms | 333.34 ms | 334.58 ms | | 128 | 195.17 img/s | 655.82 ms | 657.24 ms | 658.79 ms | 661.76 ms | **Mixed Precision Inference Latency + XLA** |**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| |--------------|------------------|---------------|---------------|---------------|---------------| | 1 | 46.19 img/s | 21.72 ms | 21.90 ms | 21.93 ms | 23.64 ms | | 2 | 80.98 img/s | 24.77 ms | 24.99 ms | 25.15 ms | 25.63 ms | | 4 | 129.49 img/s | 30.89 ms | 31.26 ms | 31.34 ms | 32.31 ms | | 8 | 156.91 img/s | 51.00 ms | 52.17 ms | 52.51 ms | 53.32 ms | | 16 | 204.45 img/s | 78.26 ms | 79.58 ms | 79.96 ms | 80.44 ms | | 32 | 215.22 img/s | 148.68 ms | 149.63 ms | 150.41 ms | 151.62 ms | | 64 | 235.36 img/s | 272.05 ms | 273.56 ms | 274.33 ms | 275.86 ms | | 128 | 244.45 img/s | 523.62 ms | 525.12 ms | 525.89 ms | 528.42 ms |