### Benchmarking The following section shows how to run benchmarks measuring the model performance in training and inference modes. #### Training performance benchmark To benchmark the training performance on a specific batch size, source length, target length and dataset for one epoch, run: ```bash bash scripts/run_training_benchmark.sh ``` The resulting `NUM_GPU` and PRECISION vs Throughput is stored in `results/bart_pyt_training_benchmark_${DATESTAMP}/inference_benchmark.log` #### Inference performance benchmark To benchmark the inference performance on a specific batch size, source length, target length and dataset, run: ```bash bash scripts/run_inference_benchmark.sh ``` The resulting `NUM_GPU` and PRECISION vs Throughput is stored in `results/bart_pyt_inference_benchmark_${DATESTAMP}/inference_benchmark.log` ### Results The following sections provide details on how we achieved our performance and accuracy in training and inference. #### Training accuracy results ##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB) Our results for XSUM dataset were obtained by running the `run_summarization.sh` training script in the PyTorch 21.02-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Accuracy column lists rogue1, rogue2 and rogueLSum scores. | GPUs | Batch size (TF32, mixed precision) | Accuracy - TF32 | Accuracy - mixed precision | Time to train (hrs) - TF32 | Time to train (hrs) - mixed precision | Time to train (hrs) speedup (TF32 to mixed precision) | |------|------------------|-----------------|----------------------------|----------------------|---------------------------------|-------------------------------------------------| | 1 | 24, 40 | 44.41, 21.02, 35.66 | 44.87, 21.49, 36.17 | 3.10 | 2.43 | 1.27 | | 8 | 192, 320 | 45.34, 21.93, 36.61 | 45.31, 21.83, 36.60 | 0.58 | 0.45 | 1.27 | In addition,results for CNN-DM dataset are: | GPUs | Batch size (TF32, mixed precision) | Accuracy - TF32 | Accuracy - mixed precision | Time to train (hrs) - TF32 | Time to train (hrs) - mixed precision | Time to train (hrs) speedup (TF32 to mixed precision) | |------|------------------|-----------------|----------------------------|----------------------|---------------------------------|-------------------------------------------------| | 1 | 24, 40 | 44.37, 21.36, 41.17 | 44.43, 21.43, 41.22 | 4.88 | 3.61 | 1.35 | | 8 | 192, 320 | 44.49, 21.48, 41.28 | 44.19, 21.26, 40.97 | 0.73 | 0.56 | 1.30 | ##### Training accuracy: NVIDIA DGX-1 V100 (8x V100 32GB) Our results were obtained by running the `run_summarization.sh` training script in the PyTorch 21.02-py3 NGC container on NVIDIA DGX-2 with (16x V100 32GB) GPUs. Accuracy column lists rogue1, rogue2 and rogueLSum scores. | GPUs | Batch size (FP32, mixed precision) | Accuracy - FP32 | Accuracy - mixed precision | Time to train (hrs) - FP32 | Time to train (hrs) - mixed precision | Time to train (hrs) speedup (FP32 to mixed precision) | |------|------------------|-----------------|----------------------------|----------------------|---------------------------------|-------------------------------------------------| | 1 | 8, 14 | 44.16, 20.66, 35.24 | 44.86, 21.41, 36.02 | 17.23 | 6.12 | 2.82 | | 8 | 64, 112 | 45.42, 21.91, 36.62 | 45.58, 22.01, 36.79 | 2.56 | 1.09 | 2.36 | In addition,results for CNN-DM dataset are: | GPUs | Batch size (FP32, mixed precision) | Accuracy - FP32 | Accuracy - mixed precision | Time to train (hrs) - FP32 | Time to train (hrs) - mixed precision | Time to train (hrs) speedup (FP32 to mixed precision) | |------|------------------|-----------------|----------------------------|----------------------|---------------------------------|-------------------------------------------------| | 1 | 8, 14 | 44.49, 21.48, 41.26 | 44.55, 21.47, 41.32 | 26.17 | 9.74 | 2.69 | | 8 | 64, 112 | 44.34, 21.42, 41.12 | 44.27, 21.30, 41.06 | 3.58 | 1.45 | 2.46 | ##### Training stability test Our results for XSUM dataset were obtained by running the `run_summarization.sh` training script in the PyTorch 21.02-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Accuracy column lists rogue1 scores across 5 different training runs with different seeds on DGX A100. | **FP16, 8x GPUs** | **seed 1** | **seed 2** | **seed 3** | **seed 4** | **seed 5** | **mean** | **std** | |:-----------:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:| |rogue1 | 45.34 | 45.34 | 45.21 | 45.33 | 45.34 | 45.31 | 0.055 | #### Training performance results ##### Training performance: NVIDIA DGX A100 (8x A100 80GB) Our results were obtained by running the `run_summarization.sh` training script in the PyTorch 21.02-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch. | GPUs | Batch size / GPU (TF32, mixed precision) | Throughput - TF32 | Throughput - mixed precision | Throughput speedup (TF32 - mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision | |------|------------------|-------------------|------------------------------|---------------------------------------------|---------------------|--------------------------------| | 1 | 24, 40 | 31607 | 42076 | 1.33 | 1.00 | 1.00 | | 8 | 24, 40 | 163054 | 217514 | 1.33 | 5.16 | 5.17 | To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide). The performance metrics used are tokens per second computed from iterating through an entire epoch of XSum dataset with source length = 1024 and target length = 60. ##### Training performance: NVIDIA DGX-1 V100 (8x V100 32GB) Our results were obtained by running the `run_summarization.sh` training script in the PyTorch 21.02-py3 NGC container on NVIDIA DGX-2 with (16x V100 32GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch. | GPUs | Batch size / GPU (FP32, mixed precision) | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision | |------|------------------|-------------------|------------------------------|---------------------------------------------|---------------------|--------------------------------| | 1 | 8, 14 | 7527 | 19356 | 2.57 | 1.00 | 1.00 | | 8 | 8, 14 | 42024 | 111720 | 2.65 | 5.58 | 5.77 | To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide). The performance metrics used are tokens per second computed from iterating through an entire epoch of XSum dataset with source length = 1024 and target length = 60. #### Inference performance results ##### Inference performance: NVIDIA DGX A100 (1x A100 80GB) Our results were obtained by running the `run_eval_summarization.sh` inferencing benchmarking script in the PyTorch 20.11-py3 NGC container on NVIDIA DGX A100 (1x A100 80GB) GPU. FP16 | Batch size | Latency Avg | Latency 90% | Latency 95% | Latency 99% | Throughput | |------------|-------------|:-----------:|:-----------:|:-----------:|------------| | 1 | 0.43 | 0.53 | 0.57 | 0.67 | 2.34 | | 4 | 0.64 | 0.75 | 0.81 | 0.95 | 6.28 | | 8 | 0.86 | 1.01 | 1.09 | 1.20 | 9.35 | | 16 | 1.29 | 1.56 | 1.65 | 1.76 | 12.44 | | 32 | 2.38 | 3.06 | 3.23 | 3.33 | 13.42 | | 64 | 4.70 | 6.06 | 6.25 | 6.35 | 13.55 | | 128 | 10.10 | 12.22 | 12.32 | 12.96 | 12.61 | To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide). The inference performance metrics used are milliseconds per iteration. They are computed by iterating through the XSum test data with source length = 1024, target length = 60 and beam search = 6. ##### Inference performance: NVIDIA DGX-1 V100 (1x V100 32GB) Our results were obtained by running the `run_eval_summarization.sh` inferencing benchmarking script in the PyTorch 20.11-py3 NGC container on NVIDIA DGX-2 with (1x V100 32GB) GPU. FP16 | Batch size | Latency Avg | Latency 90% | Latency 95% | Latency 99% | Throughput | |------------|-------------|:-----------:|:-----------:|:-----------:|------------| | 1 | 0.67 | 0.84 | 0.89 | 1.04 | 1.49 | | 4 | 0.96 | 1.14 | 1.24 | 1.43 | 4.16 | | 8 | 1.33 | 1.59 | 1.72 | 1.90 | 6.01 | | 16 | 1.99 | 2.39 | 2.57 | 2.69 | 8.04 | | 32 | 3.41 | 4.31 | 4.53 | 4.63 | 9.36 | | 64 | 6.66 | 8.61 | 8.75 | 8.92 | 9.55 | To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide). The inference performance metrics used are milliseconds per iteration. They are computed by iterating through the XSum test data with source length = 1024, target length = 60 and beam search = 6. ##### Inference performance: NVIDIA T4 Our results were obtained by running the `run_eval_summarization.sh` inferencing benchmarking script in the PyTorch 21.02-py3 NGC container on NVIDIA T4 with GPU. FP16 | Batch size | Latency Avg | Latency 90% | Latency 95% | Latency 99% | Throughput | |------------|-------------|:-----------:|:-----------:|:-----------:|------------| | 1 | 0.42 | 0.52 | 0.56 | 0.66 | 2.40 | | 4 | 0.72 | 0.89 | 0.96 | 1.09 | 5.58 | | 8 | 1.13 | 1.60 | 1.73 | 1.96 | 7.08 | | 16 | 2.25 | 3.19 | 3.38 | 3.58 | 7.11 | | 32 | 4.44 | 6.53 | 6.96 | 7.21 | 7.19 | To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide). The inference performance metrics used are milliseconds per iteration. They are computed by iterating through the XSum test data with source length = 1024, target length = 60 and beam search = 6.