Temporal Fusion Transformer for PyTorch

NVIDIA Deep Learning Examples

Resource

NVIDIA Deep Learning Examples

Temporal Fusion Transformer for PyTorch

Temporal Fusion Transformer is a state-of-the-art architecture for interpretable, multi-horizon time-series prediction.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes. Note that the first 3 steps of each epoch are not used in the throughput or latency calculation. This is due to the fact that the nvFuser performs the optimizations on the 3rd step of the first epoch causing a multi-second pause.

Training performance benchmark

In order to run training benchmarks, use the scripts/benchmark.sh script.

Inference performance benchmark

To benchmark the inference performance on a specific batch size and dataset, run the inference.py script.

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference.

Training accuracy results

We conducted an extensive hyperparameter search along with stability tests. The presented results are the averages from the hundreds of runs.

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the train.sh training script in the PyTorch 22.11 NGC container on NVIDIA A100 (8x A100 80GB) GPUs.

Dataset	GPUs	Batch size / GPU	Accuracy - TF32	Accuracy - mixed precision	Time to train - TF32	Time to train - mixed precision	Time to train speedup (TF32 to mixed precision)
Electricity	8	1024	0.026 / 0.056 / 0.029	0.028 / 0.058 / 0.029	200s	176s	1.136x
Traffic	8	1024	0.044 / 0.108 / 0.078	0.044 / 0.109 / 0.079	140s	129s	1.085x

Training accuracy: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running the train.sh training script in the PyTorch 22.11 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.

Dataset	GPUs	Batch size / GPU	Accuracy - FP32	Accuracy - mixed precision	Time to train - FP32	Time to train - mixed precision	Time to train speedup (FP32 to mixed precision)
Electricity	8	1024	0.028 / 0.057 / 0.028	0.027 / 0.059 / 0.030	371s	269s	1.379x
Traffic	8	1024	0.042 / 0.110 / 0.080	0.043 / 0.109 / 0.080	251s	191s	1.314x

Training stability test

In order to get a greater picture of the model's accuracy, we performed a hyperparameter search along with stability tests on 100 random seeds for each configuration. Then, for each benchmark dataset, we have chosen the architecture with the least mean test q-risk. The table below summarizes the best configurations.

Dataset	#GPU	Hidden size	#Heads	Local BS	LR	Gradient clipping	Dropout	Mean q-risk	Std q-risk	Min q-risk	Max q-risk
Electricity	8	128	4	1024	1e-3	0.0	0.1	0.1129	0.0025	0.1074	0.1244
Traffic	8	128	4	1024	1e-3	0.0	0.3	0.2262	0.0027	0.2207	0.2331

Training performance results

Training performance: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the train.sh training script in the PyTorch 22.11 NGC container on NVIDIA A100 (8x A100 80GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.

Dataset	GPUs	Batch size / GPU	Throughput - TF32	Throughput - mixed precision	Throughput speedup (TF32 - mixed precision)	Weak scaling - TF32	Weak scaling - mixed precision
Electricity	1	1024	12435	17608	1.42x	1	1
Electricity	8	1024	94389	130769	1.39x	7.59x	7.42x
Traffic	1	1024	12509	17591	1.40x	1	1
Traffic	8	1024	94476	130992	1.39x	7.55x	7.45x

To achieve these same results, follow the steps in the Quick Start Guide.

The performance metrics used were items per second.

Training performance: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running the train.sh training script in the PyTorch 22.11 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.

Dataset	GPUs	Batch size / GPU	Throughput - FP32	Throughput - mixed precision	Throughput speedup (FP32 - mixed precision)	Weak scaling - FP32	Weak scaling - mixed precision
Electricity	1	1024	5932	10163	1.71x	1	1
Electricity	8	1024	45566	75660	1.66x	7.68x	7.44x
Traffic	1	1024	5971	10166	1.70x	1	1
Traffic	8	1024	45925	75640	1.64x	7.69x	7.44x

To achieve these same results, follow the steps in the Quick Start Guide.

The performance metrics used were items per second.

Inference Performance Results

Inference Performance: NVIDIA DGX A100

Our results were obtained by running the inference.py script in the PyTorch 22.11 NGC container on NVIDIA DGX A100. Throughput is measured in items per second and latency is measured in milliseconds. To benchmark the inference performance on a specific batch size and dataset, run the inference.py script.

Dataset	GPUs	Batch size / GPU	Throughput - mixed precision (item/s)	Average Latency (ms)	Latency p90 (ms)	Latency p95 (ms)	Latency p99 (ms)
Electricity	1	1	272.43	3.67	3.70	3.87	4.18
Electricity	1	2	518.13	3.86	3.88	3.93	4.19
Electricity	1	4	1039.31	3.85	3.89	3.97	4.15
Electricity	1	8	2039.54	3.92	3.93	3.95	4.32
Traffic	1	1	269.59	3.71	3.74	3.79	4.30
Traffic	1	2	518.73	3.86	3.78	3.91	4.66
Traffic	1	4	1021.49	3.92	3.94	3.95	4.25
Traffic	1	8	2005.54	3.99	4.01	4.03	4.39

Inference Performance: NVIDIA DGX-1 V100

Our results were obtained by running the inference.py script in the PyTorch 22.11 NGC container on NVIDIA DGX-1 V100. Throughput is measured in items per second and latency is measured in milliseconds. To benchmark the inference performance on a specific batch size and dataset, run the inference.py script.

Dataset	GPUs	Batch size / GPU	Throughput - mixed precision (item/s)	Average Latency (ms)	Latency p90 (ms)	Latency p95 (ms)	Latency p99 (ms)
Electricity	1	1	171.68	5.82	5.99	6.17	7.00
Electricity	1	2	318.92	6.27	6.43	6.60	7.51
Electricity	1	4	684.79	5.84	6.02	6.08	6.47
Electricity	1	8	1275.54	6.27	7.31	7.36	7.51
Traffic	1	1	183.39	5.45	5.64	5.86	6.73
Traffic	1	2	340.73	5.87	6.07	6.77	7.25
Traffic	1	4	647.33	6.18	6.35	7.99	8.07
Traffic	1	8	1364.39	5.86	6.07	6.40	7.31