NGC | Catalog
CatalogResourcesWide & Deep for TensorFlow1

Wide & Deep for TensorFlow1

Logo for Wide & Deep for TensorFlow1
Wide & Deep Recommender model.
NVIDIA Deep Learning Examples
Latest Version
April 4, 2023
Compressed Size
27.83 KB

This resource is using open-source code maintained in github (see the quick-start-guide section) and available for download from NGC

Recommendation systems drive engagement on many of the most popular online platforms. As the volume of data available to power these systems grows exponentially, data scientists are increasingly turning from more traditional machine learning methods to highly expressive deep learning models to improve the quality of their recommendations. Google's Wide & Deep Learning for Recommender Systems has emerged as a popular model for these problems both for its robustness to signal sparsity as well as its user-friendly implementation in TensorFlow.

The differences between this Wide & Deep Recommender Model and the model from the paper is the size of the Deep part of the model. Originally, in Google's paper, the fully connected part was three layers of 1024, 512, and 256 neurons. Our model consists of 5 layers each of 1024 neurons.

The model enables you to train a recommender model that combines the memorization of the Wide part and generalization of the Deep part of the network.

This model is trained with mixed precision using Tensor Cores on NVIDIA Volta, Turing and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 1.49 times faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.

Model architecture

Wide & Deep refers to a class of networks that use the output of two parts working in parallel - wide model and deep model - to make predictions of recommenders. The wide model is a generalized linear model of features together with their transforms. The deep model is a series of 5 hidden MLP layers of 1024 neurons each beginning with a dense embedding of features. The architecture is presented in Figure 1.

Figure 1. The architecture of the Wide & Deep model.

Applications and dataset

As a reference dataset, we used a subset of the features engineered by the 19th place finisher in the Kaggle Outbrain Click Prediction Challenge. This competition challenged competitors to predict the likelihood with which a particular ad on a website's display would be clicked on. Competitors were given information about the user, display, document, and ad in order to train their models. More information can be found here.

Default configuration

For reference, and to give context to the acceleration numbers described below, some important properties of our features and model are as follows:

  • Features

    • Request Level
      • 16 scalar numeric features (shape=(1,))
      • 12 one-hot categorical features (all int dtype)
        • 5 indicator embeddings with sizes 2, 2, 3, 3, 6
        • 7 trainable embeddings
          • all except two have an embedding size of 64 (remaining two have 128), though it's important to note for all categorical features that we do not leverage that information to short-circuit the lookups by treating them as a single multi-hot lookup. Our API is fully general to any combination of embedding sizes.
          • all use hash bucketing with num_buckets= 300k, 100k, 4k, 2.5k, 2k, 1k, and 300 respectively
      • 3 multi-hot categorical features (all int dtype)
        • all trainable embeddings
        • all with embedding size 64
        • all use hash bucketing with num_buckets= 10k, 350, and 100 respectively
    • Item Level
      • 16 scalar numeric features
      • 4 one hot categorical features (all int dtype)
        • embedding sizes of 128, 64, 64, 64 respectively
        • hash bucketing with num_buckets= 250k, 4k, 2.5k, and 1k respectively
      • 3 multi-hot categorical features (all int dtype)
        • all with embedding size 64
        • hash bucketing with num_buckets= 10k, 350, and 100 respectively
    • All features are used in both wide and deep branches of the network
  • Model

    • Total embedding dimension is 1328
    • 5 hidden layers each with size 1024
    • Output dimension is 1 (probability of click)

Feature support matrix

The following features are supported by this model:

Feature Wide & Deep
Horovod Multi-GPU Yes
Automatic mixed precision (AMP) Yes



Horovod is a distributed training framework for TensorFlow, Keras, PyTorch and MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. For more information about how to get started with Horovod, see the Horovod: Official repository.

Multi-GPU training with Horovod

Our model uses Horovod to implement efficient multi-GPU training with NCCL. For details, see example sources in this repository or see the TensorFlow tutorial.

Mixed precision

Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of Tensor Cores in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:

  1. Porting the model to use the FP16 data type where appropriate.
  2. Adding loss scaling to preserve small gradient values.

The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK.

For information about:

Enabling mixed precision

To enable Wide & Deep training to use mixed precision you don't need to perform input quantization, only an additional flag --amp to the training script is needed (see Quick Start Guide).

Impact of mixed precision on training accuracy

The accuracy of training, measured with MAP@12 metric was not impacted by enabling mixed precision. The obtained results were statistically similar (i.e. similar run-to-run variance was observed, with standard deviation of the level of 0.002).

Impact of mixed precision on inference accuracy

For our reference model, the average absolute error on the probability of interaction induced by reduced precision inference is 0.0002, producing a near-perfect fit between predictions produced by full and mixed precision models. Moreover, this error is uncorrelated with the magnitude of the predicted value, which means for most predictions of interest (i.e. greater than 0.01 or 0.1 likelihood of interaction), the relative magnitude of the error is approaching the noise floor of the problem.

Enabling TF32

TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.

TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.

For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.

TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.


Request level features: Features that describe the person or object to which we wish to make recommendations.

Item level features: Features that describe those objects which we are considering recommending.