The following section shows how to run benchmarks measuring the model performance in training and inference modes.
In order to run training benchmarks, use the scripts/benchmark.sh
script.
To benchmark the inference performance on a specific batch size and dataset, run the inference.py
script.
The following sections provide details on how we achieved our performance and accuracy in training and inference.
We conducted an extensive hyperparameter search along with stability tests. The presented results are the averages from the hundreds of runs.
Our results were obtained by running the train.sh
training script in the PyTorch 21.06 NGC container on NVIDIA A100 (8x A100 80GB) GPUs.
Dataset | GPUs | Batch size / GPU | Accuracy - TF32 | Accuracy - mixed precision | Time to train - TF32 | Time to train - mixed precision | Time to train speedup (TF32 to mixed precision) |
---|---|---|---|---|---|---|---|
Electricity | 8 | 1024 | 0.027 / 0.057 / 0.029 | 0.028 / 0.057 / 0.029 | 216s | 176s | 1.227x |
Traffic | 8 | 1024 | 0.043 / 0.108 / 0.079 | 0.042 / 0.107 / 0.078 | 151s | 126s | 1.198x |
Our results were obtained by running the train.sh
training script in the PyTorch 21.06 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
Dataset | GPUs | Batch size / GPU | Accuracy - FP32 | Accuracy - mixed precision | Time to train - FP32 | Time to train - mixed precision | Time to train speedup (FP32 to mixed precision) |
---|---|---|---|---|---|---|---|
Electricity | 8 | 1024 | 0.028 / 0.057 / 0.029 | 0.027 / 0.057 / 0.029 | 381s | 261s | 1.460x |
Traffic | 8 | 1024 | 0.042 / 0.106 / 0.076 | 0.040 / 0.103 / 0.074 | 256s | 176s | 1.455x |
In order to get a greater picture of the model's accuracy, we performed a hyperparameter search along with stability tests on 100 random seeds for each configuration. Then, for each benchmark dataset, we have chosen the architecture with the least mean test q-risk. The table below summarizes the best configurations.
Dataset | #GPU | Hidden size | #Heads | Local BS | LR | Gradient clipping | Dropout | Mean q-risk | Std q-risk | Min q-risk | Max q-risk |
---|---|---|---|---|---|---|---|---|---|---|---|
Electricity | 8 | 128 | 4 | 1024 | 1e-3 | 0.0 | 0.1 | 0.1131 | 0.0025 | 0.1080 | 0.1200 |
Traffic | 8 | 128 | 4 | 1024 | 1e-3 | 0.0 | 0.3 | 0.2180 | 0.0049 | 0.2069 | 0.2336 |
Our results were obtained by running the train.sh
training script in the PyTorch 21.06 NGC container on NVIDIA A100 (8x A100 80GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.
Dataset | GPUs | Batch size / GPU | Throughput - TF32 | Throughput - mixed precision | Throughput speedup (TF32 - mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|---|
Electricity | 1 | 1024 | 10173 | 13703 | 1.35x | 1 | 1 |
Electricity | 8 | 1024 | 80596 | 107761 | 1.34x | 7.92x | 7.86x |
Traffic | 1 | 1024 | 10197 | 13779 | 1.35x | 1 | 1 |
Traffic | 8 | 1024 | 80692 | 107979 | 1.34x | 7.91x | 7.84x |
To achieve these same results, follow the steps in the Quick Start Guide.
The performance metrics used were items per second.
Our results were obtained by running the train.sh
training script in the PyTorch 21.06 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.
Dataset | GPUs | Batch size / GPU | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|---|
Electricity | 1 | 1024 | 5580 | 9148 | 1.64x | 1 | 1 |
Electricity | 8 | 1024 | 43351 | 69855 | 1.61x | 7.77x | 7.64x |
Traffic | 1 | 1024 | 5593 | 9194 | 1.64x | 1 | 1 |
Traffic | 8 | 1024 | 43426 | 69983 | 1.61x | 7.76x | 7.61x |
To achieve these same results, follow the steps in the Quick Start Guide.
The performance metrics used were items per second.
Our results were obtained by running the inference.py
script in the PyTorch 21.06 NGC container on NVIDIA DGX A100. Throughput is measured in items per second and latency is measured in milliseconds.
To benchmark the inference performance on a specific batch size and dataset, run the inference.py
script.
Dataset | GPUs | Batch size / GPU | Throughput - mixed precision (item/s) | Average Latency (ms) | Latency p90 (ms) | Latency p95 (ms) | Latency p99 (ms) |
---|---|---|---|---|---|---|---|
Electricity | 1 | 1 | 152.179 | 6.571 | 6.658 | 6.828 | 8.234 |
Electricity | 1 | 2 | 295.82 | 6.76 | 6.776 | 6.967 | 8.595 |
Electricity | 1 | 4 | 596.93 | 6.7 | 6.7 | 6.802 | 8.627 |
Electricity | 1 | 8 | 1464.526 | 5.461 | 5.467 | 5.638 | 7.432 |
Traffic | 1 | 1 | 152.462 | 6.559 | 6.649 | 6.832 | 7.393 |
Traffic | 1 | 2 | 297.852 | 6.715 | 6.738 | 6.878 | 8.233 |
Traffic | 1 | 4 | 598.016 | 6.688 | 6.71 | 6.814 | 7.915 |
Traffic | 1 | 8 | 1455.163 | 5.497 | 5.54 | 5.832 | 7.571 |
Our results were obtained by running the inference.py
script in the PyTorch 21.06 NGC container on NVIDIA DGX-1 V100. Throughput is measured in items per second and latency is measured in milliseconds.
To benchmark the inference performance on a specific batch size and dataset, run the inference.py
script.
Dataset | GPUs | Batch size / GPU | Throughput - mixed precision (item/s) | Average Latency (ms) | Latency p90 (ms) | Latency p95 (ms) | Latency p99 (ms) |
---|---|---|---|---|---|---|---|
Electricity | 1 | 1 | 113.613 | 8.801 | 9.055 | 10.015 | 10.764 |
Electricity | 1 | 2 | 227.097 | 8.812 | 9.065 | 9.825 | 10.983 |
Electricity | 1 | 4 | 464.545 | 8.611 | 8.696 | 8.815 | 11.105 |
Electricity | 1 | 8 | 1040.154 | 7.689 | 7.819 | 7.908 | 10.38 |
Traffic | 1 | 1 | 115.724 | 8.643 | 8.855 | 9.693 | 9.966 |
Traffic | 1 | 2 | 218.775 | 9.147 | 10.778 | 10.93 | 11.176 |
Traffic | 1 | 4 | 447.603 | 8.936 | 9.149 | 9.233 | 11.316 |
Traffic | 1 | 8 | 1042.663 | 7.673 | 7.962 | 8.04 | 9.988 |