The Deep Learning Recommendation Model (DLRM) is a recommendation model designed to make use of both categorical and numerical inputs. It was first described in [Deep Learning Recommendation Model for Personalization and Recommendation Systems](https://arxiv.org/abs/1906.00091). This repository provides a reimplementation of the code-base provided originally [here](https://github.com/facebookresearch/dlrm). The scripts enable you to train DLRM on the [Criteo Terabyte Dataset](https://labs.criteo.com/2013/12/download-terabyte-click-logs/). Using the scripts provided here, you can efficiently train models that are too large to fit into a single GPU. This is because we use a hybrid-parallel approach, which combines model parallelism with data parallelism for different parts of the neural network. This is explained in details in the [next section](#hybrid-parallel-multi-gpu-with-all-2-all-communication). This model uses a slightly different preprocessing procedure than the one found in the original implementation. You can find a detailed description of the preprocessing steps in the [Dataset guidelines](#dataset-guidelines) section. Using DLRM, you can train a high-quality general model for recommendations. This model is trained with mixed precision using Tensor Cores on Volta, Turing and NVIDIA Ampere GPU architectures. Therefore, researchers can get results 2x faster than training without Tensor Cores while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. ### Model architecture DLRM accepts two types of features: categorical and numerical. For each categorical feature, an embedding table is used to provide dense representation to each unique value. The dense features enter the model and are transformed by a simple neural network referred to as "bottom MLP". This part of the network consists of a series of linear layers with ReLU activations. The output of the bottom MLP and the embedding vectors are then fed into the "dot interaction" operation. The output of "dot interaction" is then concatenated with the features resulting from bottom MLP and fed into the "top MLP" which is a series of dense layers with activations. The model outputs a single number which can be interpreted as a likelihood of a certain user clicking an ad.

Figure 1. The architecture of DLRM.

### Default configuration The following features were implemented in this model: - general - static loss scaling for Tensor Cores (mixed precision) training - hybrid-parallel multi-GPU training - preprocessing - dataset preprocessing using Spark 3 on GPUs ### Feature support matrix The following features are supported by this model: | Feature | DLRM |----------------------|-------------------------- |Automatic mixed precision (AMP) | Yes |XLA | Yes |Hybrid-parallel multiGPU with Horovod all-to-all| Yes |Preprocessing on GPU with Spark 3| Yes |Multi-node training | Yes #### Features **Automatic Mixed Precision (AMP)** Enables mixed precision training without any changes to the code-base by performing automatic graph rewrites and loss scaling controlled by an environmental variable. **XLA** The training script supports a `--xla` flag. It can be used to enable XLA JIT compilation. Currently, we use [XLA Lite](https://docs.nvidia.com/deeplearning/frameworks/tensorflow-user-guide/index.html#xla-lite). It delivers a steady 10-30% performance boost depending on your hardware platform, precision, and the number of GPUs. It is turned off by default. **Horovod** Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. For more information about how to get started with Horovod, see the Horovod [official repository](https://github.com/horovod/horovod). **Hybrid-parallel multiGPU with Horovod all-to-all** Our model uses Horovod to implement efficient multi-GPU training with NCCL. For details, see example sources in this repository or see the TensorFlow tutorial. For the detailed description of our multi-GPU approach, visit this [section](hybrid-parallel-multi-gpu-with-all-2-all-communication). **Multi-node training** This repository supports multinode training. For more information refer to the [multinode section](#multi-node-training) ### Mixed precision training Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3.4x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps: 1. Porting the model to use the FP16 data type where appropriate. 2. Adding loss scaling to preserve small gradient values. The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK. For information about: - How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) documentation. - Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog. #### Enabling mixed precision Mixed precision training is turned off by default. To turn it on, issue the `--amp` flag to the `main.py` script. #### Enabling TF32 TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations. For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post. TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default. ### Hybrid-parallel multi-GPU with all-2-all communication Many recommendation models contain very large embedding tables. As a result, the model is often too large to fit onto a single device. This could be easily solved by training in a model-parallel way, using either the CPU or other GPUs as "memory donors". However, this approach is suboptimal as the "memory donor" devices' compute is not utilized. In this repository, we use the model-parallel approach for the bottom part of the model (Embedding Tables + bottom MLP) while using a usual data parallel approach for the top part of the model (Dot Interaction + top MLP). This way, we can train models much larger than what would normally fit into a single GPU while at the same time making the training faster by using multiple GPUs. We call this approach hybrid-parallel training. The transition from model-parallel to data-parallel in the middle of the neural net needs a specific multi-GPU communication pattern called [all-2-all](https://en.wikipedia.org/wiki/All-to-all_\(parallel_pattern\)) which is available in our [TensorFlow 2 21.02-py3](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow/tags) NGC Docker container. In the [original DLRM whitepaper](https://arxiv.org/abs/1906.00091) this has been referred to as "butterfly shuffle".

Figure 2. The default multi-GPU mode.

In this repository we train models of three sizes: "small" (15.6 GiB), "large" (84.9 GiB) and "extra large" (421 GiB). The "large" and "extra large" models need multiple GPUs as they will not fit into a single GPU memory. #### Embedding table placement and load balancing (default mode) By default, we use the following heuristic for dividing the work between the GPUs: - The bottom MLP is placed on GPU-0 and no embedding tables are placed on this device. - The tables are sorted from the largest to the smallest. - Set `max_tables_per_gpu := ceil(number_of_embedding_tables / number_of_available_gpus)`. - Repeat until all embedding tables have an assigned device: - Out of all the available GPUs, find the one with the largest amount of unallocated memory. - Place the largest unassigned embedding table on this GPU. Raise an exception if it does not fit. - If the number of embedding tables on this GPU is now equal to `max_tables_per_gpu`, remove this GPU from the list of available GPUs, so that no more embedding tables will be placed on this GPU. This ensures the all-2-all communication is well-balanced between all devices. #### Training very large embedding tables The default multi-GPU paradigm described above has a constraint – each individual table has to fit into a single device's memory. If that's not met, then an Out-of-Memory error will be raised. To enable experimentation with very large models, we provide a way of circumventing this constraint by passing the `--columnwise_split --data_parallel_bottom_mlp` command-line flags. As a result, each table will be split across the latent space dimension. Some dimensions of the latent space will be placed on one GPU and the rest of them are stored on other GPUs. This means that a table that originally encoded C unique categories into D dense dimensions will now become N separate tables of shape `[C, D / N]` each stored on a different GPU, where N is the number of GPUs used. Symbolically, the computations are exactly equivalent. The figure below illustrates this paradigm for a model with 2 embedding tables distributed across two GPUs. Note that this approach is currently slower than the default mode described above.

Figure 3. The "columnwise split" multi-GPU mode.

We tested this approach by training a DLRM model on the Criteo Terabyte dataset with the frequency limiting option turned off (set to zero). The weights of the resulting model take 421 GiB. The largest table weighs 140 GiB. Here are the commands you can use to reproduce this: ``` # build and run the preprocessing container as in the Quick Start Guide # then when preprocessing set the frequency limit to 0: ./prepare_dataset.sh DGX2 0 # build and run the training container same as in the Quick Start Guide # then append options necessary for training very large embedding tables: horovodrun -np 8 -H localhost:8 --mpi-args=--oversubscribe numactl --interleave=all -- python -u main.py --dataset_path /data/dlrm/ --amp --tf_gpu_memory_limit_gb 72 --columnwise_split --data_parallel_bottom_mlp --xla ``` When using this method on a DGX A100 with 8 A100-80GB GPUs and a large-enough dataset, it is possible to train a single embedding table of up to 600 GB. You can also use multi-node training (described below) to train even larger recommender systems. This mode was used to train the 421GiB "extra large" model in [the DGX A100-80G performance section](#training-performance-nvidia-dgx-a100-8x-a100-80gb). #### Multi-node training Multi-node training is supported. Depending on the exact interconnect hardware and model configuration, you might experience only a modest speedup with multi-node. Multi-node training can also be used to train larger models. For example, to train a 1.68 TB variant of DLRM on multi-node, you can run: ``` cmd='numactl --interleave=all -- python -u main.py --dataset_path /data/dlrm/full_criteo_data --amp \ --tf_gpu_memory_limit_gb 72 --columnwise_split --data_parallel_bottom_mlp \ --embedding_dim 512 --bottom_mlp_dims 512,256,512' \ srun_flags='--mpi=pmix' \ cont=nvidia_dlrm_tf \ mounts=/data/dlrm:/data/dlrm \ sbatch -n 32 -N 4 -t 00:20:00 slurm_multinode.sh ``` ### Preprocessing on GPU with Spark 3 Refer to the ["Preprocessing with Spark" section](#preprocess-with-spark) for a detailed description of the Spark 3 GPU functionality. ### BYO dataset functionality overview This section describes how you can train the DeepLearningExamples RecSys models on your own datasets without changing the model or data loader and with similar performance to the one published in each repository. This can be achieved thanks to Dataset Feature Specification, which describes how the dataset, data loader and model interact with each other during training, inference and evaluation. Dataset Feature Specification has a consistent format across all recommendation models in NVIDIA's DeepLearningExamples repository, regardless of dataset file type and the data loader, giving you the flexibility to train RecSys models on your own datasets. - [Glossary](#glossary) - [Dataset Feature Specification](#dataset-feature-specification) - [Data Flow in Recommendation Models in DeepLearning examples](#data-flow-in-nvidia-deep-learning-examples-recommendation-models) - [Example of Dataset Feature Specification](#example-of-dataset-feature-specification) - [BYO dataset functionality](#byo-dataset-functionality) #### Glossary The Dataset Feature Specification consists of three mandatory and one optional section: feature_spec provides a base of features that may be referenced in other sections, along with their metadata. Format: dictionary (feature name) => (metadata name => metadata value)
source_spec provides information necessary to extract features from the files that store them. Format: dictionary (mapping name) => (list of chunks)
* Mappings are used to represent different versions of the dataset (think: train/validation/test, k-fold splits). A mapping is a list of chunks.
* Chunks are subsets of features that are grouped together for saving. For example, some formats may constrain data saved in one file to a single data type. In that case, each data type would correspond to at least one chunk. Another example where this might be used is to reduce file size and enable more parallel loading. Chunk description is a dictionary of three keys:
* type provides information about the format in which the data is stored. Not all formats are supported by all models.
* features is a list of features that are saved in a given chunk. Order of this list may matter: for some formats, it is crucial for assigning read data to the proper feature.
* files is a list of paths to files where the data is saved. For Feature Specification in yaml format, these paths are assumed to be relative to the yaml file's directory (basename). Order of this list matters: It is assumed that rows 1 to i appear in the first file, rows i+1 to j in the next one, etc.
channel_spec determines how features are used. It is a mapping (channel name) => (list of feature names). Channels are model specific magic constants. In general, data within a channel is processed using the same logic. Example channels: model output (labels), categorical ids, numerical inputs, user data, and item data. metadata is a catch-all, wildcard section: If there is some information about the saved dataset that does not fit into the other sections, you can store it here. #### Dataset feature specification Data flow can be described abstractly: Input data consists of a list of rows. Each row has the same number of columns; each column represents a feature. The columns are retrieved from the input files, loaded, aggregated into channels and supplied to the model/training script. FeatureSpec contains metadata to configure this process and can be divided into three parts: * Specification of how data is organized on disk (source_spec). It describes which feature (from feature_spec) is stored in which file and how files are organized on disk. * Specification of features (feature_spec). Describes a dictionary of features, where key is feature name and values are features' characteristics such as dtype and other metadata (for example, cardinalities for categorical features) * Specification of model's inputs and outputs (channel_spec). Describes a dictionary of model's inputs where keys specify model channel's names and values specify lists of features to be loaded into that channel. Model's channels are groups of data streams to which common model logic is applied, for example categorical/continuous data, user/item ids. Required/available channels depend on the model The FeatureSpec is a common form of description regardless of underlying dataset format, dataset data loader form and model. #### Data flow in NVIDIA Deep Learning Examples recommendation models The typical data flow is as follows: * S.0. Original dataset is downloaded to a specific folder. * S.1. Original dataset is preprocessed into Intermediary Format. For each model, the preprocessing is done differently, using different tools. The Intermediary Format also varies (for example, for DLRM PyTorch, the Intermediary Format is a custom binary one.) * S.2. The Preprocessing Step outputs Intermediary Format with dataset split into training and validation/testing parts along with the Dataset Feature Specification yaml file. Metadata in the preprocessing step is automatically calculated. * S.3. Intermediary Format data together with Dataset Feature Specification are fed into training/evaluation scripts. Data loader reads Intermediary Format and feeds the data into the model according to the description in the Dataset Feature Specification. * S.4. The model is trained and evaluated

Fig.1. Data flow in Recommender models in NVIDIA Deep Learning Examples repository. Channels of the model are drawn in green.

#### Example of dataset feature specification As an example, let's consider a Dataset Feature Specification for a small CSV dataset for some abstract model. ```yaml feature_spec: user_gender: dtype: torch.int8 cardinality: 3 #M,F,Other user_age: #treated as numeric value dtype: torch.int8 user_id: dtype: torch.int32 cardinality: 2655 item_id: dtype: torch.int32 cardinality: 856 label: dtype: torch.float32 source_spec: train: - type: csv features: - user_gender - user_age files: - train_data_0_0.csv - train_data_0_1.csv - type: csv features: - user_id - item_id - label files: - train_data_1.csv test: - type: csv features: - user_id - item_id - label - user_gender - user_age files: - test_data.csv channel_spec: numeric_inputs: - user_age categorical_user_inputs: - user_gender - user_id categorical_item_inputs: - item_id label_ch: - label ``` The data contains five features: (user_gender, user_age, user_id, item_id, label). Their data types and necessary metadata are described in the feature specification section. In the source mapping section, two mappings are provided: one describes the layout of the training data, the other of the testing data. The layout for training data has been chosen arbitrarily to showcase the flexibility. The train mapping consists of two chunks. The first one contains user_gender and user_age, saved as a CSV, and is further broken down into two files. For specifics of the layout, refer to the following example and consult the glossary. The second chunk contains the remaining columns and is saved in a single file. Notice that the order of columns is different in the second chunk - this is alright, as long as the order matches the order in that file (that is, columns in the .csv are also switched) Let's break down the train source mapping. The table contains example data color-paired to the files containing it.

The channel spec describes how the data will be consumed. Four streams will be produced and available to the script/model. The feature specification does not specify what happens further: names of these streams are only lookup constants defined by the model/script. Based on this example, we can speculate that the model has three input channels: numeric_inputs, categorical_user_inputs, categorical_item_inputs, and one output channel: label. Feature names are internal to the FeatureSpec and can be freely modified. #### BYO dataset functionality In order to train any Recommendation model in NVIDIA Deep Learning Examples one can follow one of three possible ways: * One delivers already preprocessed dataset in the Intermediary Format supported by data loader used by the training script (different models use different data loaders) together with FeatureSpec yaml file describing at least specification of dataset, features and model channels * One uses a transcoding script * One delivers dataset in non-preprocessed form and uses preprocessing scripts that are a part of the model repository. In order to use already existing preprocessing scripts, the format of the dataset needs to match the one of the original datasets. This way, the FeatureSpec file will be generated automatically, but the user will have the same preprocessing as in the original model repository.