NGC | Catalog
CatalogResourcesSIM for TensorFlow2

SIM for TensorFlow2

For downloads and more information, please view on a desktop device.
Logo for SIM for TensorFlow2


Search-based Interest Model (SIM) is a system for predicting user behavior given sequences of previous interactions.


NVIDIA Deep Learning Examples

Use Case




Latest Version



November 4, 2022

Compressed Size

81.57 KB

This resource is using open-source code maintained in github (see the quick-start-guide section) and available for download from NGC

Search-based Interest Model (SIM) is a system for predicting user behavior given sequences of previous interactions. The model is based on Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction paper which reports that it has been deployed at Alibaba in the display advertising system. This repository provides a reimplementation of the code-base provided originally for SIM and DIEN models (SIM model's inner component).

There are several differences between this and the original SIM model implementation. First, this model is implemented in TensorFlow 2 using Python 3.8 instead of TensorFlow 1 in Python 2.7. Second, this implementation utilizes the user dimension (identifiers), which enables to train a personalized recommender system. Finally, the training code uses data preprocessed to TFRecord format, which improves data loading. We also include scripts necessary to preprocess Amazon Reviews dataset used in experiments.

The table below provides a fine-grained summary of the differences between this repository and the original implementation.

Mode Original implementation This repository
Python 2.7 3.8
Dataset size 135K samples 12M samples
Dataset format CSV TFRecord
Model - user id feature not included
- batch normalization included but not used correctly
- two-dimensional softmax output
- hardcoded features cardinalities
- includes user id feature
- doesn`t include batch normalization
- one-dimensional sigmoid output
- features cardinalities deducted from dataset

In the author's SIM implementation, the internals of submodels differs slightly between code and original papers (DIN, DIEN, SIM). Our implementation core is based on the paper's modules. For exact implementation details, refer to the list below.

List of implementation differences between original SIM code and DIN/DIEN/SIM papers
  • Batch normalization before NLP is not included in papers.
  • Batch normalization in code used trainable=False during the training phase.
  • ItemItemInteraction in DIN`s attention module in SIM implementation didn't correspond to activation unit inside DIN paper.
    • Element-wise subtraction and multiplications are fed to MLP, skipping outer product operation.
    • Sigmoids are used instead of PReLU/DICE in MLP.
  • Soft search MLP is missing a middle layer in implementation.
  • In the ESU part, multi-head attention is implemented as a DIN interaction block instead of a typical multi-head attention.
  • ESU part adds additional embedding by summing all the embedding passed from the GSU part.
  • DIEN auxiliary loss uses auxiliary network instead of the sigmoid of concatenated embeddings from the DIEN paper.


The model enables you to train a high-quality, personalized, sequential neural network-based recommender system.

This model is trained with mixed precision using Tensor Cores on NVIDIA Volta and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 1.48x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.

Model architecture

SIM model consists of two components: General Search Unit (GSU) and the Exact Search Unit (ESU). The goal of the former is to filter down possibly long historical user behavior sequence to a shorter and relevant sequence. On the other hand, ESU utilizes the most recent user behaviors for a candidate item, for example, estimate click-through rate for a candidate ad. Both parts are trained jointly using data on past user behaviors.

A model architecture diagram is presented below.

Figure 1. The architecture of the model.

Embeddings in model architecture diagram are obtained by passing each feature from the dataset through the Embedding Layer. Item features from target item, short behavior history and long behavior history share embedding tables.

Figure 2. Embedding of input features.

Default configuration

The following features are implemented in this model:

  • general
    • dynamic loss scaling for Tensor Cores (mixed precision) training
    • data-parallel multi-GPU training
  • preprocessing
    • dataset preprocessing using NVtabular library

The following performance optimizations were implemented in this model:

Feature support matrix

This model supports the following features:

Feature SIM v1.0 TF2
Horovod Multi-GPU (NCCL) Yes
Accelerated Linear Algebra (XLA) Yes
Automatic mixed precision (AMP) Yes
Preprocessing on GPU with NVTabular Yes
BYO dataset Yes


Multi-GPU training with Horovod Our model uses Horovod to implement efficient multi-GPU training with NCCL. For details, refer to the example sources in this repository or refer to the TensorFlow tutorial.

Accelerated Linear Algebra (XLA) XLA is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes. Enabling XLA results in improvements to speed and memory usage: most internal benchmarks run ~1.1-1.5x faster after XLA is enabled.

Automatic Mixed Precision (AMP) AMP enables mixed precision training without any changes to the code-base by performing automatic graph rewrites and loss scaling controlled by an environmental variable.

Preprocessing on GPU with NVTabular Preprocessing on GPU with NVTabular - Amazon Reviews dataset preprocessing can be conducted using NVTabular. For more information on the framework, refer to this blog post.

Mixed precision training

Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of Tensor Cores in NVIDIA Volta, and following with Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training previously required two steps:

  1. Porting the model to use the FP16 data type where appropriate.
  2. Adding loss scaling to preserve small gradient values.

This can now be achieved using Automatic Mixed Precision (AMP) for TensorFlow to enable the full mixed precision methodology in your existing TensorFlow model code. AMP enables mixed precision training on NVIDIA Volta, and NVIDIA Ampere GPU architectures automatically. The TensorFlow framework code makes all necessary model changes internally.

In TF-AMP, the computational graph is optimized to use as few casts as necessary and maximize the use of FP16, and the loss scaling is automatically applied inside of supported optimizers. AMP can be configured to work with the existing tf.contrib loss scaling manager by disabling the AMP scaling with a single environment variable to perform only the automatic mixed-precision optimization. It accomplishes this by automatically rewriting all computation graphs with the necessary operations to enable mixed precision training and automatic loss scaling.

For information about:

Enabling mixed precision

To enable SIM training to use mixed precision, use --amp flag for the training script. Refer to the Quick Start Guide for more information.

Enabling TF32

TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math, also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on NVIDIA Volta GPUs.

TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require a high dynamic range for weights or activations.

For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.

TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.

BYO dataset functionality overview

This section describes how you can train the DeepLearningExamples RecSys models on your own datasets without changing the model or data loader and with similar performance to the one published in each repository. This can be achieved thanks to Dataset Feature Specification, which describes how the dataset, data loader, and model interact with each other during training, inference, and evaluation. Dataset Feature Specification has a consistent format across all recommendation models in NVIDIA's DeepLearningExamples repository, regardless of dataset file type and the data loader, giving you the flexibility to train RecSys models on your own datasets.

BYO dataset glossary

The Dataset Feature Specification consists of three mandatory and one optional section:

feature_spec provides a base of features that may be referenced in other sections, along with their metadata. Format: dictionary (feature name) => (metadata name => metadata value)

source_spec provides information necessary to extract features from the files that store them. Format: dictionary (mapping name) => (list of chunks)

  • Mappings are used to represent different versions of the dataset (think: train/validation/test, k-fold splits). A mapping is a list of chunks.
  • Chunks are subsets of features that are grouped together for saving. For example, some formats may constrain data saved in one file to a single data type. In that case, each data type would correspond to at least one chunk. Another example where this might be used is to reduce file size and enable more parallel loading. Chunk description is a dictionary of three keys:
    • type provides information about the format in which the data is stored. Not all formats are supported by all models.
    • features is a list of features that are saved in a given chunk. The order of this list may matter: for some formats, it is crucial for assigning read data to the proper feature.
    • files is a list of paths to files where the data is saved. For Feature Specification in yaml format, these paths are assumed to be relative to the yaml file's directory (basename). Order of this list matters: It is assumed that rows 1 to i appear in the first file, rows i+1 to j in the next one, etc.

channel_spec determines how features are used. It is a mapping (channel name) => (list of feature names).

Channels are model-specific magic constants. In general, data within a channel is processed using the same logic. Example channels: model output (labels), categorical ids, numerical inputs, user data, and item data.

metadata is a catch-all, wildcard section: If there is some information about the saved dataset that does not fit into the other sections, you can store it here.

Dataset feature specification

Data flow can be described abstractly: Input data consists of a list of rows. Each row has the same number of columns; each column represents a feature. The columns are retrieved from the input files, loaded, aggregated into channels and supplied to the model/training script.

FeatureSpec contains metadata to configure this process and can be divided into three parts:

  • Specification of how data is organized on disk (source_spec). It describes which feature (from feature_spec) is stored in which file and how files are organized on disk.

  • Specification of features (feature_spec). Describes a dictionary of features, where key is the feature name and values are the features' characteristics such as dtype and other metadata (for example, cardinalities for categorical features)

  • Specification of model's inputs and outputs (channel_spec). Describes a dictionary of model's inputs where keys specify model channel's names and values specify lists of features to be loaded into that channel. Model's channels are groups of data streams to which common model logic is applied, for example categorical/continuous data, and user/item ids. Required/available channels depend on the model

The FeatureSpec is a common form of description regardless of underlying dataset format, dataset data loader form, and model.

Data flow in NVIDIA Deep Learning Examples recommendation models

The typical data flow is as follows:

  • S.0. Original dataset is downloaded to a specific folder.
  • S.1. Original dataset is preprocessed into Intermediary Format. For each model, the preprocessing is done differently, using different tools. The Intermediary Format also varies (for example, for DLRM PyTorch, the Intermediary Format is a custom binary one.)
  • S.2. The Preprocessing Step outputs Intermediary Format with dataset split into training and validation/testing parts along with the Dataset Feature Specification yaml file. Metadata in the preprocessing step is automatically calculated.
  • S.3. Intermediary Format data, together with the Dataset Feature Specification, are fed into training/evaluation scripts. The data loader reads Intermediary Format and feeds the data into the model according to the description in the Dataset Feature Specification.
  • S.4. The model is trained and evaluated

Figure 3. Data flow in Recommender models in NVIDIA Deep Learning Examples repository. Channels of the model are drawn in green.

Example of dataset feature specification

As an example, let's consider a Dataset Feature Specification for a small CSV dataset for some abstract model.

    dtype: torch.int8
    cardinality: 3 #M,F,Other
  user_age: #treated as numeric value
    dtype: torch.int8
    dtype: torch.int32
    cardinality: 2655
    dtype: torch.int32
    cardinality: 856
    dtype: torch.float32

    - type: csv
        - user_gender
        - user_age
        - train_data_0_0.csv
        - train_data_0_1.csv
    - type: csv
        - user_id
        - item_id
        - label
        - train_data_1.csv
    - type: csv
        - user_id
        - item_id
        - label
        - user_gender
        - user_age
        - test_data.csv

    - user_age
    - user_gender
    - user_id
    - item_id
    - label

The data contains five features: (user_gender, user_age, user_id, item_id, label). Their data types and necessary metadata are described in the feature specification section.

In the source mapping section, two mappings are provided: one describes the layout of the training data, and the other of the testing data. The layout for training data has been chosen arbitrarily to showcase the flexibility. The train mapping consists of two chunks. The first one contains user_gender and user_age, saved as a CSV, and is further broken down into two files. For specifics of the layout, refer to the following example and consult the glossary. The second chunk contains the remaining columns and is saved in a single file. Notice that the order of columns is different in the second chunk - this is alright, as long as the order matches the order in that file (that is, columns in the .csv are also switched)

Let's break down the train source mapping. The table contains example data color-paired to the files containing it.

The channel spec describes how the data will be consumed. Four streams will be produced and available to the script/model. The feature specification does not specify what happens further: names of these streams are only lookup constants defined by the model/script. Based on this example, we can speculate that the model has three input channels: numeric_inputs, categorical_user_inputs, categorical_item_inputs, and one output channel: label. Feature names are internal to the FeatureSpec and can be freely modified.

BYO dataset functionality

In order to train any Recommendation model in NVIDIA Deep Learning Examples, one can follow one of three possible ways:

  • One delivers preprocessed datasets in the Intermediary Format supported by data loader used by the training script (different models use different data loaders) together with FeatureSpec yaml file describing at least specification of dataset, features, and model channels

  • One uses a transcoding script (not supported in SIM model yet)

  • One delivers datasets in non-preprocessed form and uses preprocessing scripts that are a part of the model repository. In order to use already existing preprocessing scripts, the format of the dataset needs to match one of the original datasets. This way, the FeatureSpec file will be generated automatically, but the user will have the same preprocessing as in the original model repository.


Auxiliary loss is used to improve DIEN (so SIM as well) model training. It is constructed based on consecutive user actions from their short behavior history.

DIEN model was proposed in Deep Interest Evolution Network for Click-Through Rate Prediction paper as an extension of the DIN model. It can also be used as a backbone for processing short interaction sequences in the SIM model.

DIN model was proposed in Deep Interest Network for Click-Through Rate Prediction paper. It can be used as a backbone for processing short interaction sequences in the SIM model.

Long user behavior history is the record of past user interactions. They are processed by the General Search Unit part of the SIM model (refer to Figure 1). This typically is a lightweight model aimed at processing longer sequences.

Short user behavior history is the record of the most recent user interactions. They are processed by a more computationally intensive Exact Search Unit part of the SIM model (refer to Figure 1).

User behaviors are users' interactions with given items of interest. Example interactions include reviewed items for Amazon Reviews dataset or clicks in the e-commerce domain. All the systems contained in this repository focus on modeling user interactions.