NGC | Catalog
Welcome Guest
CatalogResourcesWide & Deep for TensorFlow2

Wide & Deep for TensorFlow2

For downloads and more information, please view on a desktop device.
Logo for Wide & Deep for TensorFlow2

Description

Wide & Deep Recommender model.

Publisher

NVIDIA Deep Learning Examples

Use Case

Recommender

Framework

TensorFlow2

Latest Version

22.03.0

Modified

July 8, 2022

Compressed Size

2.26 MB

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training mode.

Training and evaluation performance benchmark

Benchmark script is prepared to measure performance of the model during training (default configuration) and evaluation (--evaluation). Benchmark runs training or evaluation for --benchmark_steps batches, however measurement of performance starts after --benchmark_warmup_steps. Benchmark can be run for single and 8 GPUs and with a combination of XLA (--xla), AMP (--amp), batch sizes (--global_batch_size , --eval_batch_size) and affinity (--affinity).

In order to run benchmark follow these steps:

Run Wide & Deep Container (${HOST_OUTBRAIN_PATH} is the path with Outbrain dataset):

docker run --runtime=nvidia --gpus=all --rm -it --ipc=host -v ${HOST_OUTBRAIN_PATH}:/outbrain wd2 bash

Run the benchmark script:

horovodrun -np ${GPU} sh hvd_wrapper.sh python main.py --benchmark

Results

The following sections provide details on how we achieved our performance and accuracy in training.

Training accuracy results

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the main.py training script in the TensorFlow2 NGC container on NVIDIA DGX A100 with (8x A100 80GB) GPUs.

GPUs Batch size / GPU XLA Accuracy - TF32 (MAP@12) Accuracy - mixed precision (MAP@12) Time to train - TF32 (minutes) Time to train - mixed precision (minutes) Time to train speedup (TF32 to mixed precision)
1 131072 Yes 0.65728 0.65728 17.05 13.12 1.30
1 131072 No 0.65734 0.65732 21.75 17.50 1.24
8 16384 Yes 0.65754 0.65751 6.48 6.33 1.02
8 16384 No 0.65750 0.65754 8.07 7.87 1.03

To achieve the same results, follow the steps in the Quick Start Guide.

Training accuracy: NVIDIA DGX-1 (8x V100 32GB)

Our results were obtained by running the main.py training script in the TensorFlow2 NGC container on NVIDIA DGX-1 with (8x V100 32GB) GPUs.

GPUs Batch size / GPU XLA Accuracy - FP32 (MAP@12) Accuracy - mixed precision (MAP@12) Time to train - FP32 (minutes) Time to train - mixed precision (minutes) Time to train speedup (FP32 to mixed precision)
1 131072 Yes 0.65736 0.65731 72.38 24.60 2.94
1 131072 No 0.65736 0.65735 80.53 31.60 2.55
8 16384 Yes 0.65751 0.65752 15.62 10.13 1.54
8 16384 No 0.65749 0.65752 18.37 12.45 1.48

To achieve the same results, follow the steps in the Quick Start Guide.

Training accuracy plots

Models trained with FP32, TF32 and Automatic Mixed Precision (AMP), with and without XLA enabled achieve similar accuracy.

The plot represents MAP@12 in a function of steps (step is single batch) during training for default precision (FP32 for Volta architecture (DGX-1) and TF32 for Ampere GPU architecture (DGX-A100)) and AMP for XLA and without it for NVTabular dataset. All other parameters of training are default.


Figure 2. Learning curves for different configurations on single gpu.

Training stability test

Training of the model is stable for multiple configurations achieving the standard deviation of 10e-4. The model achieves similar MAP@12 scores for A100 and V100, training precisions, XLA usage and single/multi GPU. The Wide and Deep model was trained for 9140 training steps (20 epochs, 457 batches in each epoch, every batch containing 131072), starting from 20 different initial random seeds for each setup. The training was performed in the 22.03 Merlin Tensorflow Training NGC container on NVIDIA DGX A100 80GB and DGX-1 32GB machines with and without mixed precision enabled, with and without XLA enabled for NVTabular generated dataset. The provided charts and numbers consider single and 8 GPU training. After training, the models were evaluated on the validation set. The following plots compare distributions of MAP@12 on the evaluation set. In columns there is single vs 8 GPU training, in rows DGX A100 and DGX-1 V100.


Figure 3. Training stability plot, distribution of MAP@12 across different configurations. 'All configurations' refer to the distribution of MAP@12 for cartesian product of architecture, training precision, XLA usage, single/multi GPU.

Training stability was also compared in terms of point statistics for MAP@12 distribution for multiple configurations. Refer to the expandable table below.

Full tabular data for training stability tests
GPUs Precision XLA Mean Std Min Max
DGX A100 1 TF32 Yes 0.65728 0.00014 0.6571 0.6575
DGX A100 1 TF32 No 0.65734 0.00007 0.6572 0.6575
DGX A100 1 AMP Yes 0.65728 0.00011 0.6571 0.6575
DGX A100 1 AMP No 0.65732 0.00009 0.6572 0.6575
DGX A100 8 TF32 Yes 0.65754 0.00014 0.6573 0.6579
DGX A100 8 TF32 No 0.65750 0.00011 0.6573 0.6577
DGX A100 8 AMP Yes 0.65751 0.00013 0.6573 0.6577
DGX A100 8 AMP No 0.65754 0.00013 0.6573 0.6578
DGX-1 V100 1 FP32 Yes 0.65736 0.00011 0.6572 0.6576
DGX-1 V100 1 FP32 No 0.65736 0.00009 0.6572 0.6575
DGX-1 V100 1 AMP Yes 0.65731 0.00013 0.6571 0.6576
DGX-1 V100 1 AMP No 0.65735 0.00011 0.6571 0.6575
DGX-1 V100 8 FP32 Yes 0.65751 0.00011 0.6574 0.6578
DGX-1 V100 8 FP32 No 0.65749 0.00014 0.6572 0.6577
DGX-1 V100 8 AMP Yes 0.65752 0.00012 0.6573 0.6578
DGX-1 V100 8 AMP No 0.65752 0.00013 0.6573 0.6577
Impact of mixed precision on training accuracy

The accuracy of training, measured with MAP@12 on the evaluation set after the final epoch metric was not impacted by enabling mixed precision. The obtained results were statistically similar. The similarity was measured according to the following procedure:

The model was trained 20 times for default settings (FP32 or TF32 for Volta and Ampere architecture respectively) and 20 times for AMP. After the last epoch, the accuracy score MAP@12 was calculated on the evaluation set.

Distributions for four configurations: architecture (A100, V100) and single/multi gpu for NVTabular dataset are presented below.


Figure 4. Influence of AMP on MAP@12 distribution for DGX A100 and DGX-1 V100 for single and multi gpu training.

Distribution scores for full precision training and AMP training were compared in terms of mean, variance and Kolmogorov–Smirnov test to state statistical difference between full precision and AMP results. Refer to the expandable table below.

Full tabular data for AMP influence on MAP@12
GPUs XLA Mean MAP@12 for Full precision (TF32 for A100, FP32 for V100) Std MAP@12 for Full precision (TF32 for A100, FP32 for V100) Mean MAP@12 for AMP Std MAP@12 for AMP KS test value: statistics, p-value
DGX A100 1 Yes 0.65728 0.00014 0.65728 0.00011 0.15000 (0.98314)
DGX A100 8 Yes 0.65754 0.00014 0.65751 0.00013 0.10000 (0.99999)
DGX A100 1 No 0.65734 0.00007 0.65732 0.00009 0.20000 (0.83197)
DGX A100 8 No 0.65750 0.00011 0.65754 0.00013 0.15000 (0.98314)
DGX-1 V100 1 Yes 0.65736 0.00011 0.65731 0.00013 0.20000 (0.83197)
DGX-1 V100 8 Yes 0.65751 0.00011 0.65752 0.00012 0.10000 (0.99999)
DGX-1 V100 1 No 0.65736 0.00009 0.65735 0.00011 0.05000 (1.00000)
DGX-1 V100 8 No 0.65749 0.00014 0.65752 0.00013 0.15000 (0.98314)

Training performance results

Training performance: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the benchmark script (main.py --benchmark) in the TensorFlow2 NGC container on NVIDIA DGX A100 with (8x A100 80GB) GPUs.

GPUs Batch size / GPU XLA Throughput - TF32 (samples/s) Throughput - mixed precision (samples/s) Throughput speedup (TF32 - mixed precision) Strong scaling - TF32 Strong scaling - mixed precision
1 131072 Yes 1655113 2346864 1.42 1.00 1.00
1 131072 No 1198447 1568767 1.31 1.00 1.00
8 16384 Yes 5364411 5852297 1.09 3.24 2.49
8 16384 No 3955617 4048638 1.02 3.30 2.58
Training performance: NVIDIA DGX-1 (8x V100 32GB)

Our results were obtained by running the benchmark script (main.py --benchmark) in the TensorFlow2 NGC container on NVIDIA DGX-1 with (8x V100 32GB) GPUs.

GPUs Batch size / GPU XLA Throughput - FP32 (samples/s) Throughput - mixed precision (samples/s) Throughput speedup (FP32 - mixed precision) Strong scaling - FP32 Strong scaling - mixed precision
1 131072 Yes 338245 1111894 3.29 1.00 1.00
1 131072 No 293062 814952 2.78 1.00 1.00
8 16384 Yes 1869462 3549165 1.90 5.53 3.19
8 16384 No 1489016 2491795 1.67 5.08 3.06

Evaluation performance results

Evaluation performance: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the benchmark script (main.py --evaluate --benchmark) in the TensorFlow2 NGC container on NVIDIA DGX A100 with 8x A100 80GB GPUs.

GPUs Batch size / GPU XLA Throughput [samples/s] TF32 Throughput [samples/s] AMP Throughput speedup AMP to TF32
1 4096 No 631542 605132 0.96
1 8192 No 1003923 1025958 1.02
1 16384 No 1436331 1465785 1.02
1 32768 No 1807615 1965822 1.09
1 65536 No 2114939 2320347 1.10
1 131072 No 2343520 2638773 1.13
8 4096 No 4474162 4129841 0.92
8 8192 No 6984567 6977303 1.00
8 16384 No 10398419 10872412 1.05
8 32768 No 13896799 13704361 0.99
8 65536 No 15933755 17760589 1.11

For more results go to the expandable table below.

Full tabular data for evaluation performance results for DGX A100
GPUs Batch size / GPU XLA Throughput [samples/s] TF32 Throughput [samples/s] AMP Throughput speedup AMP to TF32
1 4096 Yes 765213 802188 1.05
1 4096 No 631542 605132 0.96
1 8192 Yes 1162267 1233427 1.06
1 8192 No 1003923 1025958 1.02
1 16384 Yes 1643782 1824973 1.11
1 16384 No 1436331 1465785 1.02
1 32768 Yes 2014538 2248111 1.12
1 32768 No 1807615 1965822 1.09
1 65536 Yes 2308737 2666944 1.16
1 65536 No 2114939 2320347 1.10
1 131072 Yes 2515197 2944289 1.17
1 131072 No 2343520 2638773 1.13
1 4096 Yes 5235260 5386308 1.03
1 4096 No 4474162 4129841 0.92
1 8192 Yes 8438479 8625083 1.02
1 8192 No 6984567 6977303 1.00
1 16384 Yes 12629246 12146912 0.96
1 16384 No 10398419 10872412 1.05
1 32768 Yes 14908125 17372751 1.17
1 32768 No 13896799 13704361 0.99
1 65536 Yes 17899139 19909649 1.11
1 65536 No 15933755 17760589 1.11
Evaluation performance: NVIDIA DGX-1 (8x V100 32GB)

Our results were obtained by running the benchmark script (main.py --evaluate --benchmark) in the TensorFlow2 NGC container on NVIDIA DGX-1 with (8x V100 32GB) GPUs.

GPUs Batch size / GPU XLA Throughput [samples/s] FP32 Throughput [samples/s] AMP Throughput speedup AMP to FP32
1 4096 No 311886 363685 1.17
1 8192 No 454822 639173 1.41
1 16384 No 594582 959301 1.61
1 32768 No 705038 1279068 1.81
1 65536 No 748398 1510412 2.02
1 131072 No 787982 1677366 2.13
8 4096 No 2210862 2548723 1.15
8 8192 No 3408621 4474287 1.31
8 16384 No 4368245 6518982 1.49
8 32768 No 5153906 8689990 1.69
8 65536 No 5393286 11071794 2.05

For more results go to the expandable table below.

Full tabular data for evaluation performance for DGX-1 V100 results
GPUs Batch size / GPU XLA Throughput [samples/s] FP32 Throughput [samples/s] AMP Throughput speedup AMP to FP32
1 4096 Yes 349110 419470 1.20
1 4096 No 311886 363685 1.17
1 8192 Yes 495663 738806 1.49
1 8192 No 454822 639173 1.41
1 16384 Yes 641953 1112849 1.73
1 16384 No 594582 959301 1.61
1 32768 Yes 737395 1442387 1.96
1 32768 No 705038 1279068 1.81
1 65536 Yes 794009 1693861 2.13
1 65536 No 748398 1510412 2.02
1 131072 Yes 819904 1887338 2.30
1 131072 No 787982 1677366 2.13
1 4096 Yes 2505902 3165730 1.26
1 4096 No 2210862 2548723 1.15
1 8192 Yes 3759356 5289218 1.41
1 8192 No 3408621 4474287 1.31
1 16384 Yes 4686372 7551041 1.61
1 16384 No 4368245 6518982 1.49
1 32768 Yes 5398782 9615114 1.78
1 32768 No 5153906 8689990 1.69
1 65536 Yes 5642629 11907666 2.11
1 65536 No 5393286 11071794 2.05