Recommendation systems drive engagement on many of the most popular online platforms. As the volume of data available to power these systems grows exponentially, Data Scientists are increasingly turning from more traditional machine learning methods to highly expressive deep learning models to improve the quality of their recommendations.
Google's Wide & Deep Learning for Recommender Systems has emerged as a popular model for Click Through Rate (CTR) prediction tasks thanks to its power of generalization (deep part) and memorization (wide part). The differences between this Wide & Deep Recommender Model and the model from the paper is the size of the deep part of the model. Originally, in Google's paper, the fully connected part was three layers of 1024, 512, and 256 neurons. Our model consists of 5 layers each of 1024 neurons.
This model is trained with mixed precision using Tensor Cores on NVIDIA Volta and NVIDIA Ampere GPU architectures. Therefore, researchers can get results 3.5 times faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
Wide & Deep refers to a class of networks that use the output of two parts working in parallel - wide model and deep model - to make a binary prediction of CTR. The wide model is a linear model of features together with their transforms. The deep model is a series of 5 hidden MLP layers of 1024 neurons. The model can handle both numerical continuous features as well as categorical features represented as dense embeddings. The architecture of the model is presented in Figure 1.
Figure 1. The architecture of the Wide & Deep model.
Applications and dataset
As a reference dataset, we used a subset of the features engineered by the 19th place finisher in the Kaggle Outbrain Click Prediction Challenge. This competition challenged competitors to predict the likelihood with which a particular ad on a website's display would be clicked on. Competitors were given information about the user, display, document, and ad in order to train their models. More information can be found here.
The Outbrain Dataset is preprocessed in order to get features input to the model. To give context to the acceleration numbers described below, some important properties of our features and model are as follows.
- 5 scalar numeric features
- 8 one-hot categorical features
- 3 multi-hot categorical features
- 11 trainable embeddings of (dimension, cardinality of categorical variable, hotness for multi-hot):
(128,300000), (19,4), (128,100000), (64,4000), (64,1000), (64,2500), (64,300), (64,2000), (64, 350, 3), (64, 10000, 3), (64, 100, 3)
- 11 trainable embeddings for wide part of size 1 (serving as an embedding from the categorical to scalar space for input to the wide portion of the model)
- 5 scalar numeric features
- 8 scalar numeric features
- 5 one-hot categorical features
- 5 trainable embeddings of (dimension, cardinality of categorical variable): (128,250000), (64,2500), (64,4000), (64,1000), (128,5000)
- 5 trainable embeddings for wide part of size 1 (working as trainable one-hot embeddings)
- 8 scalar numeric features
Features describe both the user (Request Level features) and Item (Item Level Features).
- Input dimension is 29 (16 categorical and 13 numerical features)
- Total embedding dimension is 1235
- 5 hidden layers each with size 1024
- Total number of model parameter is ~92M
- Output dimension is 1 (
yis the probability of click given Request-level and Item-level features)
- Loss function: Binary Crossentropy
For more information about feature preprocessing, go to Dataset preprocessing.
Model accuracy metric
Model accuracy is defined with the MAP@12 metric. This metric follows the way of assessing model accuracy in the original Kaggle Outbrain Click Prediction Challenge. In this repository, the leaked clicked ads are not taken into account since in industrial setup Data Scientists do not have access to leaked information when training the model. For more information about data leak in Kaggle Outbrain Click Prediction challenge, visit this blogpost by the 19th place finisher in that competition.
Training and evaluation script also reports Loss (BCE) values.
Feature support matrix
The following features are supported by this model:
|Wide & Deep
|Horovod Multi-GPU (NCCL)
|Accelerated Linear Algebra (XLA)
|Automatic mixed precision (AMP)
Horovod is a distributed training framework for TensorFlow, Keras, PyTorch and MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. For more information about how to get started with Horovod, see: Horovod: Official repository.
Multi-GPU training with Horovod Our model uses Horovod to implement efficient multi-GPU training with NCCL. For details, see example sources in this repository or see: TensorFlow tutorial.
XLA is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes. Enabling XLA results in improvements to speed and memory usage: most internal benchmarks run ~1.1-1.5x faster after XLA is enabled. For more information on XLA visit official XLA page.
Mixed precision training
Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of Tensor Cores in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training previously required two steps:
- Porting the model to use the FP16 data type where appropriate.
- Adding loss scaling to preserve small gradient values.
The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK.
For more information:
- How to train using mixed precision, see the Mixed Precision Training paper and Training With Mixed Precision documentation.
- Techniques used for mixed precision training, see the Mixed-Precision Training of Deep Neural Networks blog.
- How to access and enable AMP for TensorFlow, see Using TF-AMP from the TensorFlow User Guide.
For information on the influence of mixed precision training on model accuracy in train and inference, go to Training accuracy results.
Enabling mixed precision
To enable Wide & Deep training to use mixed precision, add the additional flag
--amp to the training script. Refer to the Quick Start Guide for more information.
TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.
TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.
TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
Request level features Features that describe the person and context to which we wish to make recommendations.
Item level features Features that describe those objects which we are considering recommending.