NGC | Catalog
Welcome Guest
CatalogResourcesTemporal Fusion Transformer for PyTorch

Temporal Fusion Transformer for PyTorch

For downloads and more information, please view on a desktop device.
Logo for Temporal Fusion Transformer for PyTorch

Description

Temporal Fusion Transformer is a state-of-the-art architecture for interpretable, multi-horizon time-series prediction.

Publisher

NVIDIA

Use Case

Other

Framework

PyTorch

Latest Version

21.06.1

Modified

February 3, 2022

Compressed Size

1.09 MB

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

In order to run training benchmarks, use the scripts/benchmark.sh script.

Inference performance benchmark

To benchmark the inference performance on a specific batch size and dataset, run the inference.py script.

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference.

Training accuracy results

We conducted an extensive hyperparameter search along with stability tests. The presented results are the averages from the hundreds of runs.

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the train.sh training script in the PyTorch 21.06 NGC container on NVIDIA A100 (8x A100 80GB) GPUs.

Dataset GPUs Batch size / GPU Accuracy - TF32 Accuracy - mixed precision Time to train - TF32 Time to train - mixed precision Time to train speedup (TF32 to mixed precision)
Electricity 8 1024 0.027 / 0.057 / 0.029 0.028 / 0.057 / 0.029 216s 176s 1.227x
Traffic 8 1024 0.043 / 0.108 / 0.079 0.042 / 0.107 / 0.078 151s 126s 1.198x
Training accuracy: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running the train.sh training script in the PyTorch 21.06 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.

Dataset GPUs Batch size / GPU Accuracy - FP32 Accuracy - mixed precision Time to train - FP32 Time to train - mixed precision Time to train speedup (FP32 to mixed precision)
Electricity 8 1024 0.028 / 0.057 / 0.029 0.027 / 0.057 / 0.029 381s 261s 1.460x
Traffic 8 1024 0.042 / 0.106 / 0.076 0.040 / 0.103 / 0.074 256s 176s 1.455x
Training stability test

In order to get a greater picture of the model's accuracy, we performed a hyperparameter search along with stability tests on 100 random seeds for each configuration. Then, for each benchmark dataset, we have chosen the architecture with the least mean test q-risk. The table below summarizes the best configurations.

Dataset #GPU Hidden size #Heads Local BS LR Gradient clipping Dropout Mean q-risk Std q-risk Min q-risk Max q-risk
Electricity 8 128 4 1024 1e-3 0.0 0.1 0.1131 0.0025 0.1080 0.1200
Traffic 8 128 4 1024 1e-3 0.0 0.3 0.2180 0.0049 0.2069 0.2336

Training performance results

Training performance: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the train.sh training script in the PyTorch 21.06 NGC container on NVIDIA A100 (8x A100 80GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.

Dataset GPUs Batch size / GPU Throughput - TF32 Throughput - mixed precision Throughput speedup (TF32 - mixed precision) Weak scaling - TF32 Weak scaling - mixed precision
Electricity 1 1024 10173 13703 1.35x 1 1
Electricity 8 1024 80596 107761 1.34x 7.92x 7.86x
Traffic 1 1024 10197 13779 1.35x 1 1
Traffic 8 1024 80692 107979 1.34x 7.91x 7.84x

To achieve these same results, follow the steps in the Quick Start Guide.

The performance metrics used were items per second.

Training performance: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running the train.sh training script in the PyTorch 21.06 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.

Dataset GPUs Batch size / GPU Throughput - FP32 Throughput - mixed precision Throughput speedup (FP32 - mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
Electricity 1 1024 5580 9148 1.64x 1 1
Electricity 8 1024 43351 69855 1.61x 7.77x 7.64x
Traffic 1 1024 5593 9194 1.64x 1 1
Traffic 8 1024 43426 69983 1.61x 7.76x 7.61x

To achieve these same results, follow the steps in the Quick Start Guide.

The performance metrics used were items per second.

Inference Performance Results

Inference Performance: NVIDIA DGX A100

Our results were obtained by running the inference.py script in the PyTorch 21.06 NGC container on NVIDIA DGX A100. Throughput is measured in items per second and latency is measured in milliseconds. To benchmark the inference performance on a specific batch size and dataset, run the inference.py script.

Dataset GPUs Batch size / GPU Throughput - mixed precision (item/s) Average Latency (ms) Latency p90 (ms) Latency p95 (ms) Latency p99 (ms)
Electricity 1 1 152.179 6.571 6.658 6.828 8.234
Electricity 1 2 295.82 6.76 6.776 6.967 8.595
Electricity 1 4 596.93 6.7 6.7 6.802 8.627
Electricity 1 8 1464.526 5.461 5.467 5.638 7.432
Traffic 1 1 152.462 6.559 6.649 6.832 7.393
Traffic 1 2 297.852 6.715 6.738 6.878 8.233
Traffic 1 4 598.016 6.688 6.71 6.814 7.915
Traffic 1 8 1455.163 5.497 5.54 5.832 7.571
Inference Performance: NVIDIA DGX-1 V100

Our results were obtained by running the inference.py script in the PyTorch 21.06 NGC container on NVIDIA DGX-1 V100. Throughput is measured in items per second and latency is measured in milliseconds. To benchmark the inference performance on a specific batch size and dataset, run the inference.py script.

Dataset GPUs Batch size / GPU Throughput - mixed precision (item/s) Average Latency (ms) Latency p90 (ms) Latency p95 (ms) Latency p99 (ms)
Electricity 1 1 113.613 8.801 9.055 10.015 10.764
Electricity 1 2 227.097 8.812 9.065 9.825 10.983
Electricity 1 4 464.545 8.611 8.696 8.815 11.105
Electricity 1 8 1040.154 7.689 7.819 7.908 10.38
Traffic 1 1 115.724 8.643 8.855 9.693 9.966
Traffic 1 2 218.775 9.147 10.778 10.93 11.176
Traffic 1 4 447.603 8.936 9.149 9.233 11.316
Traffic 1 8 1042.663 7.673 7.962 8.04 9.988