NGC | Catalog
CatalogResourcesDLRM for TensorFlow2

DLRM for TensorFlow2

For downloads and more information, please view on a desktop device.
Logo for DLRM for TensorFlow2

Description

The Deep Learning Recommendation Model (DLRM) is a recommendation model designed to make use of both categorical and numerical inputs.

Publisher

NVIDIA Deep Learning Examples

Latest Version

22.06.0

Modified

April 4, 2023

Compressed Size

57.49 KB

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

To benchmark the training performance on a specific batch size, follow the instructions in the Quick Start Guide. You can also add the --max_steps 1000 if you want to get a reliable throughput measurement without running the entire training.

You can also use synthetic data by running with the --dataset_type synthetic option if you haven't downloaded the dataset yet.

Inference performance benchmark

To benchmark the inference performance on a specific batch size, run:

horovodrun -np 1 -H localhost:1 --mpi-args=--oversubscribe numactl --interleave=all -- python -u main.py --dataset_path /data/dlrm/ --amp --restore_checkpoint_path <checkpoint_path> --mode inference

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference.

We used three model size variants to show memory scalability in multi-GPU setup:

Name Dataset Number of parameters Model size
small Criteo 1TB, FL=15 4.2B 15.6 GiB
large Criteo 1TB, FL=3 22.8B 84.9 GiB
extra large Criteo 1TB, FL=0 113B 421 GiB

Training accuracy results

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running training scripts as described in the Quick Start Guide in the DLRM Docker container.

GPUs Model size Batch size / GPU Accuracy (AUC) - TF32 Accuracy (AUC) - mixed precision Time to train - TF32 [minutes] Time to train - mixed precision [minutes] Time to train speedup (TF32 to mixed precision)
1 small 64k 0.8025 0.8025 26.75 16.27 1.64
8 large 8k 0.8027 0.8026 8.77 6.57 1.33
8 extra large 8k 0.8026 0.8026 10.47 9.08 1.15
Training accuracy: NVIDIA DGX-1 (8x V100 32GB)

Our results were obtained by running training scripts as described in the Quick Start Guide in the DLRM Docker container.

GPUs Model size Batch size / GPU Accuracy (AUC) - FP32 Accuracy (AUC) - mixed precision Time to train - FP32 [minutes] Time to train - mixed precision [minutes] Time to train speedup (FP32 to mixed precision)
1 small 64k 0.8027 0.8025 109.63 34.83 3.15
8 large 8k 0.8028 0.8026 26.01 13.73 1.89
Training accuracy: NVIDIA DGX-2 (16x V100 32GB)

Our results were obtained by running training scripts as described in the Quick Start Guide in the DLRM Docker container.

GPUs Model size Batch size / GPU Accuracy (AUC) - FP32 Accuracy (AUC) - mixed precision Time to train - FP32 [minutes] Time to train - mixed precision [minutes] Time to train speedup (FP32 to mixed precision)
1 small 64k 0.8026 0.8026 105.13 33.37 3.15
8 large 8k 0.8027 0.8027 21.21 11.43 1.86
16 large 4k 0.8025 0.8026 15.52 10.88 1.43
Training stability test

The histograms below show the distribution of ROC AUC results achieved at the end of the training for each precision/hardware platform tested. There are no statistically significant differences between precision, number of GPUs or hardware platform. Using the larger dataset has a modest, positive impact on final AUC score.


Figure 4. Results of stability tests for DLRM.

Training performance results

We used throughput in items processed per second as the performance metric.

Training performance: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by following the commands from the Quick Start Guide in the DLRM Docker container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers (in items per second) were averaged over 1000 training steps.

GPUs Model size Batch size / GPU Throughput - TF32 Throughput - mixed precision Throughput speedup (TF32 to mixed precision)
1 small 64k 2.68M 4.47M 1.67
8 large 8k 9.39M 13.31M 1.42
8 extra large 8k 9.93M 12.1M 1.22

To achieve these results, follow the steps in the Quick Start Guide.

Training performance: comparison with CPU for the "extra large" model

For the "extra large" model (113B parameters) we also obtained CPU results for comparison using the same source code (using the --cpu command line flag for the CPU-only experiments).

We compare three hardware setups:

  • CPU only,
  • a single GPU that uses CPU memory for the largest embedding tables,
  • Hybrid-Parallel using the full DGX A100-80GB
Hardware Throughput [samples / second] Speedup over CPU
2xAMD EPYC 7742 17.7k 1x
1xA100-80GB + 2xAMD EPYC 7742 (large embeddings on CPU) 768k 43x
DGX A100 (8xA100-80GB) (hybrid parallel) 12.1M 683x
Training performance: NVIDIA DGX-1 (8x V100 32GB)
GPUs Model size Batch size / GPU Throughput - FP32 Throughput - mixed precision Throughput speedup (FP32 to mixed precision)
1 small 64k 0.648M 2.06M 3.18
8 large 8k 2.9M 5.89M 2.03

To achieve the same results, follow the steps in the Quick Start Guide.

Training performance: NVIDIA DGX-2 (16x V100 32GB)
GPUs Model size Batch size / GPU Throughput - FP32 Throughput - mixed precision Throughput speedup (FP32 to mixed precision)
1 small 64k 0.675M 2.16M 3.2
8 large 8k 3.75M 7.72M 2.06
16 large 4k 5.74M 9.39M 1.64

To achieve the same results, follow the steps in the Quick Start Guide.

Inference performance results

Inference performance: NVIDIA DGX A100 (8x A100 80GB)
GPUs Model size Batch size / GPU Throughput - TF32 Throughput - mixed precision Average latency - TF32 [ms] Average latency - mixed precision [ms] Throughput speedup (mixed precision to TF32)
1 small 2048 1.43M 1.54M 1.48 1.33 1.08
Inference performance: NVIDIA DGX1V-32GB (8x V100 32GB)
GPUs Model size Batch size / GPU Throughput - FP32 Throughput - mixed precision Average latency - FP32 [ms] Average latency - mixed precision [ms] Throughput speedup (mixed precision to FP32)
1 small 2048 0.765M 1.05M 2.90 1.95 1.37
Inference performance: NVIDIA DGX2 (16x V100 16GB)
GPUs Model size Batch size / GPU Throughput - FP32 Throughput - mixed precision Average latency - FP32 [ms] Average latency - mixed precision [ms] Throughput speedup (mixed precision to FP32)
1 small 2048 1.03M 1.37M 2.10 1.63 1.53