The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference). ### Benchmarking The following section shows how to run benchmarks measuring the model performance in training and inference modes. #### Training performance benchmark The training benchmark was run in various scenarios on A100 80GB and V100 16G GPUs. The benchmark does not require a checkpoint from a fully trained model. To benchmark training, run: ``` python -m torch.distributed.launch --nproc_per_node={NGPU} \ main.py --batch-size {bs} \ --mode benchmark-training \ --benchmark-warmup 100 \ --benchmark-iterations 200 \ {AMP} \ --data {data} ``` Where the `{NGPU}` selects number of GPUs used in benchmark, the `{bs}` is the desired batch size, the `{AMP}` is set to `--amp` if you want to benchmark training with Tensor Cores, and the `{data}` is the location of the COCO 2017 dataset. `--benchmark-warmup` is specified to omit the first iteration of the first epoch. `--benchmark-iterations` is a number of iterations used to measure performance. #### Inference performance benchmark Inference benchmark was run on 1x A100 80GB GPU and 1x V100 16G GPU. To benchmark inference, run: ``` python main.py --eval-batch-size {bs} \ --mode benchmark-inference \ --benchmark-warmup 100 \ --benchmark-iterations 200 \ {AMP} \ --data {data} ``` Where the `{bs}` is the desired batch size, the `{AMP}` is set to `--amp` if you want to benchmark inference with Tensor Cores, and the `{data}` is the location of the COCO 2017 dataset. `--benchmark-warmup` is specified to omit the first iterations of the first epoch. `--benchmark-iterations` is a number of iterations used to measure performance. ### Results The following sections provide details on how we achieved our performance and accuracy in training and inference. #### Training accuracy results ##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB) Our results were obtained by running the `./examples/SSD300_A100_{FP16,TF32}_{1,4,8}GPU.sh` script in the `pytorch-21.05-py3` NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. |GPUs |Batch size / GPU|Accuracy - TF32|Accuracy - mixed precision|Time to train - TF32|Time to train - mixed precision|Time to train speedup (TF32 to mixed precision)| |-----------|----------------|---------------|---------------------------|--------------------|--------------------------------|------------------------------------------------| |1 |64 |0.26 |0.26 |07:45:00 |05:09:00 |150.49% | |4 |64 |0.26 |0.26 |01:59:00 |01:19:00 |149.52% | |8 |64 |0.25 |0.26 |01:02:00 |00:40:00 |155.64% | |1 |128 |0.26 |0.26 |07:36:00 |04:57:00 |153.50% | |4 |128 |0.26 |0.26 |01:55:00 |01:15:00 |152.92% | |8 |128 |0.26 |0.25 |00:58:00 |00:38:00 |151.89% | |1 |256 |0.26 |0.26 |07:34:00 |04:53:00 |154.80% | |4 |256 |0.25 |0.26 |01:54:00 |01:14:00 |152.98% | |8 |256 |0.248 |0.25 |00:57:00 |00:37:00 |151.46% | ##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB) Our results were obtained by running the `./examples/SSD300_FP{16,32}_{1,4,8}GPU.sh` script in the `pytorch-21.05-py3` NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. |GPUs |Batch size / GPU|Accuracy - FP32|Accuracy - mixed precision|Time to train - FP32|Time to train - mixed precision|Time to train speedup (FP32 to mixed precision)| |-----------|----------------|---------------|---------------------------|--------------------|--------------------------------|------------------------------------------------| |1 |32 |0.26 |0.26 |20:14:00 |10:09:00 |199.30% | |4 |32 |0.25 |0.25 |05:10:00 |02:40:00 |193.88% | |8 |32 |0.26 |0.25 |02:35:00 |01:20:00 |192.24% | |1 |64 | |0.26 |09:34:00 | | | |4 |64 | |0.26 |02:27:00 | | | |8 |64 | |0.26 |01:14:00 | | | Due to smaller size, mixed precision models can be trained with bigger batches. In such cases mixed precision speedup is calculated versus FP32 training with maximum batch size for that precision ##### Training loss plot Here are example graphs of FP32, TF32 and AMP training on 8 GPU configuration: ![TrainingLoss](https://github.com/NVIDIA/DeepLearningExamples/raw/master/PyTorch/Detection/SSD/img/training_loss.png) ##### Training stability test The SSD300 v1.1 model was trained for 65 epochs, starting from 15 different initial random seeds. The training was performed in the `pytorch-21.05-py3` NGC container on NVIDIA DGX A100 8x A100 80GB GPUs with batch size per GPU = 128. After training, the models were evaluated on the test dataset. The following table summarizes the final mAP on the test set. |**Precision**|**Average mAP**|**Standard deviation**|**Minimum**|**Maximum**|**Median**| |------------:|--------------:|---------------------:|----------:|----------:|---------:| | AMP | 0.2514314286 | 0.001498316675 | 0.24456 | 0.25182 | 0.24907 | | TF32 | 0.2489106667 | 0.001749463047 | 0.24487 | 0.25148 | 0.24848 | #### Training performance results ##### Training performance: NVIDIA DGX A100 (8x A100 80GB) Our results were obtained by running the `main.py` script with the `--mode benchmark-training` flag in the `pytorch-21.05-py3` NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch. |GPUs |Batch size / GPU|Throughput - TF32|Throughput - mixed precision|Throughput speedup (TF32 - mixed precision)|Weak scaling - TF32 |Weak scaling - mixed precision | |-----------|----------------|-----------------|-----------------------------|-------------------------------------------|--------------------------------|------------------------------------------------| |1 |64 |279.85 |428.30 |153.04% |100% |100% | |4 |64 |1095.17 |1660.59 |151.62% |391% |387% | |8 |64 |2181.21 |3301.58 |151.36% |779% |770% | |1 |128 |286.17 |440.74 |154.01% |100% |100% | |4 |128 |1135.02 |1755.94 |154.70% |396% |398% | |8 |128 |2264.92 |3510.29 |154.98% |791% |796% | To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above. ##### Training performance: NVIDIA DGX-1 (8x V100 16GB) Our results were obtained by running the `main.py` script with the `--mode benchmark-training` flag in the `pytorch-21.05-py3` NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch. |GPUs |Batch size / GPU|Throughput - FP32|Throughput - mixed precision|Throughput speedup (FP32 - mixed precision)|Weak scaling - FP32 |Weak scaling - mixed precision | |-----------|----------------|-----------------|-----------------------------|-------------------------------------------|--------------------------------|------------------------------------------------| |1 |32 |108.27 |212.95 |196.68% |100% |100% | |4 |32 |425.07 |826.38 |194.41% |392% |388% | |8 |32 |846.58 |1610.82 |190.27% |781% |756% | |1 |64 | |227.69 | | |100% | |4 |64 | |891.27 | | |391% | |8 |64 | |1770.09 | | |777% | Due to smaller size, mixed precision models can be trained with bigger batches. In such cases mixed precision speedup is calculated versus FP32 training with maximum batch size for that precision To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above. #### Inference performance results ##### Inference performance: NVIDIA DGX A100 (1x A100 80GB) Our results were obtained by running the `main.py` script with `--mode benchmark-inference` flag in the pytorch-21.05-py3 NGC container on NVIDIA DGX A100 (1x A100 80GB) GPU. |Batch size |Throughput - TF32|Throughput - mixed precision|Throughput speedup (TF32 - mixed precision)|Weak scaling - TF32 |Weak scaling - mixed precision | |-----------|-----------------|-----------------------------|-------------------------------------------|--------------------|--------------------------------| |1 |105.53 | 90.62 | 85% |100% | 100% | |2 |197.77 | 168.41 | 85% |187% | 185% | |4 |332.10 | 323.68 | 97% |314% | 357% | |8 |526.12 | 523.96 | 99% |498% | 578% | |16 |634.50 | 816.91 |128% |601% | 901% | |32 |715.35 | 956.91 |133% |677% |1055% | |64 |752.57 |1053.39 |139% |713% |1162% | To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above. ##### Inference performance: NVIDIA DGX-1 (1x V100 16GB) Our results were obtained by running the `main.py` script with `--mode benchmark-inference` flag in the pytorch-21.05-py3 NGC container on NVIDIA DGX-1 with (1x V100 16GB) GPU. |Batch size |Throughput - FP32|Throughput - mixed precision|Throughput speedup (FP32 - mixed precision)|Weak scaling - FP32 |Weak scaling - mixed precision | |-----------|-----------------|-----------------------------|-------------------------------------------|--------------------|--------------------------------| |1 | 75.05 | 57.03 | 75% |100% |100% | |2 |138.39 |117.12 | 84% |184% |205% | |4 |190.74 |185.38 | 97% |254% |325% | |8 |237.34 |368.48 |155% |316% |646% | |16 |285.32 |504.77 |176% |380% |885% | |32 |306.22 |548.87 |179% |408% |962% | To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.