NGC | Catalog
CatalogResourcesDLRM for PyTorch

DLRM for PyTorch

For downloads and more information, please view on a desktop device.
Logo for DLRM for PyTorch

Description

The Deep Learning Recommendation Model (DLRM) is a recommendation model designed to make use of both categorical and numerical inputs.

Publisher

NVIDIA Deep Learning Examples

Use Case

Recommender

Framework

Other

Latest Version

21.10.0

Modified

November 4, 2022

Compressed Size

74.06 KB

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

To benchmark the training performance on a specific batch size, please follow the instructions in the Quick Start Guide. You can also add the --max_steps 1000 --benchmark_warmup_steps 500 if you want to get a reliable throughput measurement without running the entire training.

You can create a synthetic dataset by running python -m dlrm.scripts.prepare_synthetic_dataset --synthetic_dataset_dir /tmp/dlrm_synthetic_data if you haven't yet downloaded the dataset.

Inference performance benchmark

To benchmark the inference performance on a specific batch size, run:

python -m dlrm.scripts.main --mode inference_benchmark --dataset /data

You can also create a synthetic dataset by running python -m dlrm.scripts.prepare_synthetic_dataset --synthetic_dataset_dir /tmp/dlrm_synthetic_data if you haven't yet downloaded the dataset.

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference.

We used three model size variants to show memory scalability in a multi-GPU setup:

Model variant Frequency threshold Model size
small 15 15 GB
large 3 82 GB
xlarge 2 142 GB

Training accuracy results

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running dlrm/scripts/main.py script as described in the Quick Start Guide in the DLRM Docker container using NVIDIA A100 80GB GPUs.

GPUs Model size Batch size / GPU Accuracy (AUC) - TF32 Accuracy (AUC) - mixed precision Time to train - TF32] Time to train - mixed precision Time to train speedup (TF32 to mixed precision)
8 large 8k 0.802509 0.802528 0:06:27 0:04:36 1.40217
1 small 64k 0.802537 0.802521 0:24:26 0:17:47 1.37395
Training accuracy: NVIDIA DGX-1 (8x V100 32GB)

Our results were obtained by running dlrm/scripts/main.py script as described in the Quick Start Guide in the DLRM Docker container using NVIDIA V100 32GB GPUs.

GPUs Model size Batch size / GPU Accuracy (AUC) - FP32 Accuracy (AUC) - mixed precision Time to train - FP32] Time to train - mixed precision Time to train speedup (FP32 to mixed precision)
8 large 8k 0.802568 0.802562 0:28:24 0:11:45 2.41702
1 small 64k 0.802784 0.802723 1:58:10 0:38:17 3.08663
Training accuracy plots

Models trained with FP32, TF32, and Automatic Mixed Precision (AMP) achieve similar accuracy.

The plot represents ROC AUC metric as a function of steps (step is single batch) during training for default precision (FP32 for Volta architecture (DGX-1) and TF32 for Ampere GPU architecture (DGX-A100)), and AMP for all three datasets. All other parameters of training are default.


Figure 1. Training stability for a FL3 dataset: distribution of ROC AUC across different configurations. 'All configurations' refer to the distribution of ROC AUC for cartesian product of architecture, training precision.


Figure 2. Training stability for a FL15 dataset: distribution of ROC AUC across different configurations. 'All configurations' refer to the distribution of ROC AUC for cartesian product of architecture, training precision.

Training stability test

Training of the model is stable for multiple configurations achieving a standard deviation of 10e-4. The model achieves similar ROC AUC scores for A100 and V100, training precisions. It was trained for one epoch (roughly 4 billion samples, 64014 batches), starting from 10 different initial random seeds for each setup. The training was performed in the pytorch:21.10-py3 NGC container with and without mixed precision enabled. The provided charts and numbers consider single and multi GPU training. After training, the models were evaluated on the test set. The following plots compare distributions of ROC AUC on the test set.


Figure 3. Training stability for a FL3 dataset: distribution of ROC AUC across different configurations. 'All configurations' refer to the distribution of ROC AUC for cartesian product of architecture, training precision.


Figure 4. Training stability for a FL15 dataset: distribution of ROC AUC across different configurations. 'All configurations' refer to the distribution of ROC AUC for cartesian product of architecture, training precision.

Impact of mixed precision on training accuracy

The accuracy of training, measured with ROC AUC on the test set after the final epoch metric was not impacted by enabling mixed precision. The obtained results were statistically similar. The similarity was measured according to the following procedure:

The model was trained 10 times for default settings (FP32 or TF32 for Volta and Ampere architecture respectively) and 10 times for AMP. After the last epoch, the accuracy score ROC AUC was calculated on the test set.

Distributions for two hardware configurations (A100, V100) for 2 datasets are presented below.


Figure 5. Impact of AMP on ROC AUC distribution for A100 and V100 GPUs for single- and multi-gpu training on a dataset with a frequency threshold of 3.


Figure 6. Impact of AMP on ROC AUC distribution for A100 and V100 GPUs for single- and multi-gpu training on a dataset with a frequency threshold of 15.

Distribution of AUC ROC for single precision training (TF32 for A100, FP32 for Volta) and AMP training were compared in terms of mean, variance and Kolmogorov–Smirnov test to state statistical difference between single precision and AMP results. Refer to the expandable table below.

Full tabular data for AMP influence on AUC ROC
Hardware Dataset GPUs mean AUC ROC for full precision std AUC ROC for full precision mean AUC ROC for AMP std AUC ROC for AMP KS test value: statictics, p-value
DGX A100 FL3 8 0.802681 0.000073 0.802646 0.000063 ('0.400', '0.418')
DGX-2 FL3 16 0.802614 0.000073 0.802623 0.000122 ('0.267', '0.787')
Sample size was set to 10 experiments for each training setup.

Training performance results

We used throughput in items processed per second as the performance metric.

Training performance: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the following commands:

  • for single-GPU setup:
    python -m dlrm.scripts.main --dataset /data --amp --cuda_graphs
    
  • for multi-GPU setup:
    python -m torch.distributed.launch --no_python --use_env --nproc_per_node 8 \
            bash  -c './bind.sh --cpu=dgxa100_ccx.sh --mem=dgxa100_ccx.sh python -m dlrm.scripts.main \
            --dataset /data --amp --cuda_graphs'
    

in the DLRM Docker container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers (in records of data per second) were averaged over an entire training epoch.

GPUs Model size Batch size / GPU Throughput - TF32 Throughput - mixed precision Throughput speedup (TF32 to mixed precision)
8 large 8k 11,400,000 16,500,000 1.447
1 small 64k 2,880,000 4,020,000 1.396

To achieve these same results, follow the steps in the Quick Start Guide.

Training performance: NVIDIA DGX-1 (8x V100 32GB)

Our results were obtained by running the following commands:

  • for single-GPU:
    python -m dlrm.scripts.main --mode train --dataset /data --amp --cuda_graphs
    
  • for multi-GPU :
    python -m torch.distributed.launch --no_python --use_env --nproc_per_node 8 \
            bash  -c './bind.sh  --cpu=exclusive -- python -m dlrm.scripts.main \
            --dataset /data --amp --cuda_graphs'
    

in the DLRM Docker container on NVIDIA DGX-1 with (8x V100 32GB) GPUs. Performance numbers (in records of data per second) were averaged over an entire training epoch.

GPUs Model size Batch size / GPU Throughput - FP32 Throughput - mixed precision Throughput speedup (FP32 to mixed precision)
8 large 8k 2,880,000 6,920,000 2.403
1 small 64k 672,000 2,090,000 3.110

To achieve these same results, follow the steps in the Quick Start Guide.

Training performance: NVIDIA DGX-2 (16x V100 32GB)

Our results were obtained by running the following commands:

  • for single-GPU:
    python -m dlrm.scripts.main --dataset /data --amp --cuda_graphs 
    
  • for multi-GPU:
    python -m torch.distributed.launch --no_python --use_env --nproc_per_node [8/16] \
            bash  -c './bind.sh  --cpu=exclusive -- python -m dlrm.scripts.main \
            --dataset /data --amp --cuda_graphs'
    
    in the DLRM Docker container on NVIDIA DGX-2 with (16x V100 32GB) GPUs. Performance numbers (in records of data per second) were averaged over an entire training epoch.
GPUs Model size Batch size / GPU Throughput - FP32 Throughput - mixed precision Throughput speedup (FP32 to mixed precision)
16 large 4k 4,740,000 10,800,000 2.278
8 large 8k 3,330,000 7,930,000 2.381
1 small 64k 717,000 2,250,000 3.138

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance results

Inference performance: NVIDIA A100 (1x A100 80GB)

Our results were obtained by running the --inference_benchmark mode in the DLRM Docker container on on the NVIDIA A100 (1x A100 80GB) GPU.

Mixed PrecisionTF32
CUDA Graphs ONCUDA Graphs OFFCUDA Graphs ONCUDA Graphs OFF
Batch sizeThroughput AvgLatency AvgThroughput AvgLatency AvgThroughput AvgLatency AvgThroughput AvgLatency Avg
3276814,796,0240.0022114,369,0470.002288,832,2250.003718,637,0000.00379
1638414,217,3400.0011513,673,6230.001208,540,1910.001928,386,6940.00195
819212,769,5830.0006411,336,2040.000727,658,4590.001077,463,7400.00110
409610,556,1400.00039 8,203,2850.000506,777,9650.000606,142,0760.00067
2048 8,415,8890.00024 4,785,4790.000435,214,9900.000394,365,9540.00047
1024 5,045,7540.00020 2,357,9530.000433,854,5040.000272,615,6010.00039
512 3,168,2610.00016 1,190,9890.000432,441,3100.000211,332,9440.00038
256 1,711,7490.00015 542,3100.000471,365,3200.00019 592,0340.00043
128 889,7770.00014 274,2230.00047 790,9840.00016 300,9080.00043
64 459,7280.00014 136,1800.00047 416,4630.00015 150,3820.00043
32 222,3860.00014 70,1070.00046 174,1630.00018 75,7680.00042
16 117,3860.00014 34,9830.00046 108,9920.00015 38,3690.00042
8 59,2000.00014 18,8520.00042 55,6610.00014 19,4400.00041
4 29,6090.00014 8,5050.00047 27,9570.00014 10,2060.00039
2 14,0660.00014 4,6100.00043 13,0100.00015 5,2290.00038

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance: NVIDIA DGX-1 (1x V100 32GB)

Our results were obtained by running the --inference_benchmark mode in the DLRM Docker container on NVIDIA DGX-1 with (1x V100 32GB) GPU.

Mixed PrecisionFP32
CUDA Graphs ONCUDA Graphs OFFCUDA Graphs ONCUDA Graphs OFF
Batch sizeThroughput AvgLatency AvgThroughput AvgLatency AvgThroughput AvgLatency AvgThroughput AvgLatency Avg
327686,716,2400.004886,792,7390.004821,809,3450.018111,802,8510.01818
163846,543,5440.002506,520,5190.002511,754,7130.009341,745,2140.00939
81926,215,1940.001326,074,4460.001351,669,1880.004911,656,3930.00495
40965,230,4430.000784,901,4510.000841,586,6660.002581,574,0680.00260
20484,261,1240.000483,523,2390.000581,462,0060.001401,416,9850.00145
10243,306,7240.000312,047,2740.000501,277,8600.000801,161,0320.00088
5122,049,3820.000251,005,9190.000511,016,1860.00050 841,7320.00061
2561,149,9970.00022 511,1020.00050 726,3490.00035 485,1620.00053
128 663,0480.00019 264,0150.00048 493,8780.00026 238,9360.00054
64 359,5050.00018 132,9130.00048 295,2730.00022 124,1200.00052
32 175,4650.00018 64,2870.00050 157,6290.00020 63,9190.00050
16 99,2070.00016 31,0620.00052 83,0190.00019 34,6600.00046
8 52,5320.00015 16,4920.00049 43,2890.00018 17,8930.00045
4 27,6260.00014 8,3910.00048 22,6920.00018 8,9230.00045
2 13,7910.00015 4,1460.00048 11,7470.00017 4,4870.00045

To achieve these same results, follow the steps in the Quick Start Guide.