This resource is using open-source code maintained in github (see the quick-start-guide section) and available for download from NGC
Recommendation systems drive engagement on many of the most popular online platforms. As the volume of data available to power these systems grows exponentially, Data Scientists are increasingly turning from more traditional machine learning methods to highly expressive deep learning models to improve the quality of their recommendations.
Google's Wide & Deep Learning for Recommender Systems has emerged as a popular model for Click Through Rate (CTR) prediction tasks thanks to its power of generalization (deep part) and memorization (wide part). The difference between this Wide & Deep Recommender Model and the model from the paper is the size of the deep part of the model. Originally, in Google's paper, the fully connected part was three layers of 1,024, 512, and 256 neurons. Our model consists of five layers, each of 1,024 neurons.
This model is trained with mixed precision using Tensor Cores on NVIDIA Volta and NVIDIA Ampere GPU architectures. Therefore, researchers can get results 3.5 times faster than training without Tensor Cores while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
Wide & Deep refers to a class of networks that use the output of two parts working in parallel - wide model and deep model - to make a binary prediction of CTR. The wide model is a linear model of features together with their transforms. The deep model is a series of five hidden MLP layers of 1,024 neurons. The model can handle both numerical continuous features as well as categorical features represented as dense embeddings. The architecture of the model is presented in Figure 1.
Figure 1. The architecture of the Wide & Deep model.
As a reference dataset, we used a subset of the features engineered by the 19th place finisher in the Kaggle Outbrain Click Prediction Challenge. This competition challenged competitors to predict the likelihood with which a particular ad on a website's display would be clicked on. Competitors were given information about the user, display, document, and ad in order to train their models. More information can be found here.
The Outbrain Dataset is preprocessed in order to get features input to the model. To give context to the acceleration numbers described below, some important properties of our features and model are as follows.
Features:
Request Level:
dtype=float32
dtype=int32
dtype=int32
Item Level:
dtype=float32
dtype=int32
Features describe both the user (Request Level features) and Item (Item Level Features).
y
is the probability of click given Request-level and Item-level features)For more information about feature preprocessing, go to Dataset preprocessing.
Model accuracy is defined with the MAP@12 metric. This metric follows the way of assessing model accuracy in the original Kaggle Outbrain Click Prediction Challenge. In this repository, the leaked clicked ads are not taken into account since in an industrial setup data scientists do not have access to leaked information when training the model. For more information about data leak in the Kaggle Outbrain Click Prediction challenge, visit this blogpost by the 19th place finisher in that competition.
Training and evaluation script also reports Loss (BCE) values.
This model supports the following features:
Feature | Wide & Deep |
---|---|
Horovod Multi-GPU (NCCL) | Yes |
Accelerated Linear Algebra (XLA) | Yes |
Automatic mixed precision (AMP) | Yes |
Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. For more information about how to get started with Horovod, refer to : Horovod: Official repository.
Multi-GPU training with Horovod Our model uses Horovod to implement efficient multi-GPU training with NCCL. For details, refer to example sources in this repository or refer to: TensorFlow tutorial.
XLA is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes. Enabling XLA results in improvements to speed and memory usage: most internal benchmarks run ~1.1-1.5x faster after XLA is enabled. For more information on XLA, visit official XLA page.
Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of Tensor Cores in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training previously required two steps:
The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK.
For more information:
For information on the influence of mixed precision training on model accuracy in train and inference, go to Training accuracy results.
To enable Wide & Deep training to use mixed precision, add the additional flag --amp
to the training script. Refer to the Quick Start Guide for more information.
TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.
TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models requiring high dynamic ranges for weights or activations.
For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.
TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
Request level features Features that describe the person and context to which we wish to make recommendations.
Item level features Features that describe those objects which we are considering recommending.
This section describes how you can train the DeepLearningExamples RecSys models on your own datasets without changing the model or data loader and with similar performance to the one published in each repository. This can be achieved thanks to Dataset Feature Specification, which describes how the dataset, data loader and model interact with each other during training, inference and evaluation. Dataset Feature Specification has a consistent format across all recommendation models in NVIDIA's DeepLearningExamples repository, regardless of dataset file type and the data loader, giving you the flexibility to train RecSys models on your own datasets.
The Dataset Feature Specification consists of three mandatory and one optional section:
feature_spec provides a base of features that may be referenced in other sections, along with their metadata.
Format: dictionary (feature name) => (metadata name => metadata value)
source_spec provides information necessary to extract features from the files that store them.
Format: dictionary (mapping name) => (list of chunks)
channel_spec determines how features are used. It is a mapping (channel name) => (list of feature names).
Channels are model-specific magic constants. In general, data within a channel is processed using the same logic. Example channels: model output (labels), categorical ids, numerical inputs, user data, and item data.
metadata is a catch-all, wildcard section: If there is some information about the saved dataset that does not fit into the other sections, you can store it here.
Data flow can be described abstractly: Input data consists of a list of rows. Each row has the same number of columns; each column represents a feature. The columns are retrieved from the input files, loaded, aggregated into channels, and supplied to the model/training script.
FeatureSpec contains metadata to configure this process and can be divided into three parts:
Specification of how data is organized on disk (source_spec). It describes which feature (from feature_spec) is stored in which file and how files are organized on disk.
Specification of features (feature_spec). Describes a dictionary of features, where key is the feature name and values are the features' characteristics, such as dtype and other metadata (for example, cardinalities for categorical features)
Specification of model's inputs and outputs (channel_spec). Describes a dictionary of model's inputs where keys specify model channel's names and values specify lists of features to be loaded into that channel. Model's channels are groups of data streams to which common model logic is applied, for example, categorical/continuous data, and user/item ids. Required/available channels depend on the model
The FeatureSpec is a common form of description regardless of underlying dataset format, dataset data loader form, and model.
The typical data flow is as follows:
Fig.1. Data flow in Recommender models in NVIDIA Deep Learning Examples repository. Channels of the model are drawn in green.
For example, let's consider a Dataset Feature Specification for a small CSV dataset for some abstract model.
feature_spec:
user_gender:
dtype: torch.int8
cardinality: 3 #M,F,Other
user_age: #treated as numeric value
dtype: torch.int8
user_id:
dtype: torch.int32
cardinality: 2655
item_id:
dtype: torch.int32
cardinality: 856
label:
dtype: torch.float32
source_spec:
train:
- type: csv
features:
- user_gender
- user_age
files:
- train_data_0_0.csv
- train_data_0_1.csv
- type: csv
features:
- user_id
- item_id
- label
files:
- train_data_1.csv
test:
- type: csv
features:
- user_id
- item_id
- label
- user_gender
- user_age
files:
- test_data.csv
channel_spec:
numeric_inputs:
- user_age
categorical_user_inputs:
- user_gender
- user_id
categorical_item_inputs:
- item_id
label_ch:
- label
The data contains five features: (user_gender, user_age, user_id, item_id, label). Their data types and necessary metadata are described in the feature specification section.
In the source mapping section, two mappings are provided: one describes the layout of the training data, and the other of the testing data. The layout for training data has been chosen arbitrarily to showcase the flexibility. The train mapping consists of two chunks. The first one contains user_gender and user_age, saved as a CSV, and is further broken down into two files. For specifics of the layout, refer to the following example and consult the glossary. The second chunk contains the remaining columns and is saved in a single file. Notice that the order of columns is different in the second chunk - this is alright, as long as the order matches the order in that file (that is, columns in the .csv are also switched)
Let's break down the train source mapping. The table contains example data color-paired to the files containing it.
The channel spec describes how the data will be consumed. Four streams will be produced and available to the script/model. The feature specification does not specify what happens further: names of these streams are only lookup constants defined by the model/script. Based on this example, we can speculate that the model has three input channels: numeric_inputs, categorical_user_inputs, categorical_item_inputs, and one output channel: label. Feature names are internal to the FeatureSpec and can be freely modified.
In order to train any Recommendation model in NVIDIA Deep Learning Examples, one can follow one of three possible ways:
One delivers an already preprocessed dataset in the Intermediary Format supported by the data loader used by the training script (different models use different data loaders) together with FeatureSpec yaml file describing at least specification of dataset, features, and model channels
One uses a transcoding script
One delivers a dataset in non-preprocessed form and uses preprocessing scripts that are a part of the model repository. In order to use already existing preprocessing scripts, the format of the dataset needs to match one of the original datasets. This way, the FeatureSpec file will be generated automatically, but the user will have the same preprocessing as in the original model repository.