NGC | Catalog
CatalogResourcesWide & Deep for TensorFlow1

Wide & Deep for TensorFlow1

For downloads and more information, please view on a desktop device.
Logo for Wide & Deep for TensorFlow1

Description

Wide & Deep Recommender model.

Publisher

NVIDIA Deep Learning Examples

Use Case

Recommender

Framework

Other

Latest Version

20.10.0

Modified

November 4, 2022

Compressed Size

27.83 KB

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training mode.

Training performance benchmark

We provide 8 scripts to benchmark the performance of training:

bash scripts/DGXA100_benchmark_training_tf32_1gpu.sh
bash scripts/DGXA100_benchmark_training_amp_1gpu.sh
bash scripts/DGXA100_benchmark_training_tf32_8gpu.sh
bash scripts/DGXA100_benchmark_training_amp_8gpu.sh
bash scripts/DGX1_benchmark_training_fp32_1gpu.sh
bash scripts/DGX1_benchmark_training_amp_1gpu.sh
bash scripts/DGX1_benchmark_training_fp32_8gpu.sh
bash scripts/DGX1_benchmark_training_amp_8gpu.sh

Results

The following sections provide details on how we achieved our performance and accuracy in training.

Training accuracy results

Training accuracy: NVIDIA DGX A100 (8x A100 40GB)

Our results were obtained by running the trainer/task.py training script in the TensorFlow NGC container on NVIDIA DGX A100 with (8x A100 40GB) GPUs.

GPUs Batch size / GPU Accuracy - TF32 (MAP@12) Accuracy - mixed precision (MAP@12) Time to train - TF32 (minutes) Time to train - mixed precision (minutes) Time to train speedup (TF32 to mixed precision)
1 131,072 0.67683 0.67632 341 359 -
8 16,384 0.67709 0.67721 93 107 -

To achieve the same results, follow the steps in the Quick Start Guide.

Training accuracy: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running the trainer/task.py training script in the TensorFlow NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.

GPUs Batch size / GPU Accuracy - FP32 (MAP@12) Accuracy - mixed precision (MAP@12) Time to train - FP32 (minutes) Time to train - mixed precision (minutes) Time to train speedup (FP32 to mixed precision)
1 131,072 0.67648 0.67744 654 440 1.49
8 16,384 0.67692 0.67725 190 185 1.03

To achieve the same results, follow the steps in the Quick Start Guide.

Training accuracy plots

Models trained with FP32, TF32 and Automatic Mixed Precision (AMP) achieve similar precision.

MAP12

Training stability test

The Wide and Deep model was trained for 54,713 training steps, starting from 6 different initial random seeds for each setup. The training was performed in the 20.10-tf1-py3 NGC container on NVIDIA DGX A100 40GB and DGX-1 16GB machines with and without mixed precision enabled. After training, the models were evaluated on the validation set. The following table summarizes the final MAP@12 score on the validation set.

Average MAP@12 Standard deviation Minimum Maximum
DGX A100 TF32 0.67709 0.00094 0.67463 0.67813
DGX A100 mixed precision 0.67721 0.00048 0.67643 0.67783
DGX-1 FP32 0.67692 0.00060 0.67587 0.67791
DGX-1 mixed precision 0.67725 0.00064 0.67561 0.67803

Training performance results

Training performance: NVIDIA DGX A100 (8x A100 40GB)

Our results were obtained by running the benchmark scripts from the scripts directory in the TensorFlow NGC container on NVIDIA DGX A100 with (8x A100 40GB) GPUs. Improving model scaling for multi-GPU is under development.

GPUs Batch size / GPU Throughput - TF32 (samples/s) Throughput - mixed precision (samples/s) Strong scaling - TF32 Strong scaling - mixed precision
1 131,072 349,879 332,529 1.00 1.00
8 16,384 1,283,457 1,111,976 3.67 3.34
Training performance: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running the benchmark scripts from the scripts directory in the TensorFlow NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs. Improving model scaling for multi-GPU is under development.

GPUs Batch size / GPU Throughput - FP32 (samples/s) Throughput - mixed precision (samples/s) Throughput speedup (FP32 to mixed precision) Strong scaling - FP32 Strong scaling - mixed precision
1 131,072 182,510 271,366 1.49 1.00 1.00
8 16,384 626,301 643,334 1.03 3.43 2.37