DLRM for PyTorch | NVIDIA NGC

NVIDIA

DLRM for PyTorch

Resource

NVIDIA

DLRM for PyTorch

The Deep Learning Recommendation Model (DLRM) is a recommendation model designed to make use of both categorical and numerical inputs.

The following sections provide greater details of the dataset, running training and inference, and the training results.

Scripts and sample code

The dlrm/scripts/main.py script provides an entry point to most of the functionality. Using different command-line flags allows you to run training, validation, and benchmark both training and inference on real or synthetic data.

Utilities related to loading the data reside in the data directory.

Command-line options

The dlrm/scripts/main.py script supports a number of command-line flags. You can get the descriptions of those by running python -m dlrm.scripts.main --help.

The following example output is printed when running the model:

Epoch:[0/1] [200/128028]  eta: 1:28:44  loss: 0.1782  step_time: 0.041657  lr: 0.8794
Epoch:[0/1] [400/128028]  eta: 1:25:15  loss: 0.1403  step_time: 0.038504  lr: 1.7544
Epoch:[0/1] [600/128028]  eta: 1:23:56  loss: 0.1384  step_time: 0.038422  lr: 2.6294
Epoch:[0/1] [800/128028]  eta: 1:23:13  loss: 0.1370  step_time: 0.038421  lr: 3.5044
Epoch:[0/1] [1000/128028]  eta: 1:22:45  loss: 0.1362  step_time: 0.038464  lr: 4.3794
Epoch:[0/1] [1200/128028]  eta: 1:22:24  loss: 0.1346  step_time: 0.038455  lr: 5.2544
Epoch:[0/1] [1400/128028]  eta: 1:22:07  loss: 0.1339  step_time: 0.038459  lr: 6.1294
Epoch:[0/1] [1600/128028]  eta: 1:21:52  loss: 0.1320  step_time: 0.038481  lr: 7.0044
Epoch:[0/1] [1800/128028]  eta: 1:21:39  loss: 0.1315  step_time: 0.038482  lr: 7.8794
Epoch:[0/1] [2000/128028]  eta: 1:21:27  loss: 0.1304  step_time: 0.038466  lr: 8.7544
Epoch:[0/1] [2200/128028]  eta: 1:21:15  loss: 0.1305  step_time: 0.038430  lr: 9.6294

Getting the data

This example uses the Criteo Terabyte Dataset. The first 23 days are used as the training set. The last day is split in half. The first part, referred to as "test", is used for validating training results. The second one, referred to as "validation", is unused.

Dataset guidelines

The preprocessing steps applied to the raw data include:

Replacing the missing values with 0
Replacing the categorical values that exist fewer than FL times with a special value (FL value is called a frequency threshold or a frequency limit)
Converting the hash values to consecutive integers
Adding 3 to all the numerical features so that all of them are greater or equal to 1
Taking a natural logarithm of all numerical features

BYO dataset

This implementation supports using other datasets thanks to BYO dataset functionality. The BYO dataset functionality allows users to plug in their dataset in a common fashion for all Recommender models that support this functionality. Using BYO dataset functionality, the user does not have to modify the source code of the model thanks to the Feature Specification file. For general information on how BYO dataset works, refer to the BYO dataset overview section.

There are three ways to plug in user's dataset:

1. Provide an unprocessed dataset in a format matching the one used by Criteo 1TB, then use Criteo 1TB's preprocessing. Feature Specification file is then generated automatically.

The required format of the user's dataset is:

The data should be split into text files. Each line of those text files should contain a single training example. An example should consist of multiple fields separated by tabulators:

The first field is the label – 1 for a positive example and 0 for negative.
The next N tokens should contain the numerical features separated by tabs.
The next M tokens should contain the hashed categorical features separated by tabs.

The correct dataset files together with the Feature Specification yaml file will be generated automatically by preprocessing script.

For an example of using this process, refer to the Quick Start Guide

2. Provide a CSV containing preprocessed data and a simplified Feature Specification yaml file, then transcode the data with `transcode.py` script

This option should be used if the user has their own CSV file with a preprocessed dataset they want to train on.

The required format of the user's dataset is:

CSV files containing the data, already split into train and test sets.
Feature Specification yaml file describing the layout of the CSV data

For an example of a feature specification file, refer to the tests/transcoding folder.

The CSV containing the data:

should be already split into train and test
should contain no header
should contain one column per feature, in the order specified by the list of features for that chunk in the source_spec section of the feature specification file
categorical features should be non-negative integers in the range [0,cardinality-1] if cardinality is specified

The Feature Specification yaml file:

needs to describe the layout of data in CSV files
may contain information about cardinalities. However, if set to auto, they will be inferred from the data by the transcoding script.

Refer to tests/transcoding/small_csv.yaml for an example of the yaml Feature Specification.

The following example shows how to use this way of plugging user's dataset:

Prepare your data and save the path:

DATASET_PARENT_DIRECTORY=/raid/dlrm

Build the DLRM image with:

docker build -t nvidia_dlrm_pyt .

Launch the container with:

docker run --runtime=nvidia -it --rm --ipc=host  -v ${DATASET_PARENT_DIRECTORY}:/data nvidia_dlrm_preprocessing bash

If you are just testing the process, you can create synthetic csv data:

python -m dlrm.scripts.gen_csv --feature_spec_in tests/transcoding/small_csv.yaml

Convert the data:

mkdir /data/conversion_output
python -m dlrm.scripts.transcode --input /data --output /data/converted

You may need to tune the --chunk_size parameter. Higher values speed up the conversion but require more RAM.

This will convert the data from /data and save the output in /data/converted. A feature specification file describing the new data will be automatically generated.

To run the training on 1 GPU:

python -m dlrm.scripts.main --mode train --dataset /data/converted --amp --cuda_graphs

multi-GPU for DGX A100:

python -m torch.distributed.launch --no_python --use_env --nproc_per_node 8 \
          bash  -c './bind.sh --cpu=dgxa100_ccx.sh --mem=dgxa100_ccx.sh python -m dlrm.scripts.main \
          --dataset /data/converted --seed 0 --epochs 1 --amp --cuda_graphs'

multi-GPU for DGX-1 and DGX-2:

python -m torch.distributed.launch --no_python --use_env --nproc_per_node 8 \
          bash  -c './bind.sh  --cpu=exclusive -- python -m dlrm.scripts.main \
          --dataset /data/converted --seed 0 --epochs 1 --amp --cuda_graphs'

3. Provide a fully preprocessed dataset, saved in split binary files, and a Feature Specification yaml file

This is the option to choose if you want full control over preprocessing and/or want to preprocess data directly to the target format.

Your final output will need to contain a Feature Specification yaml describing data and file layout. For an example feature specification file, refer to tests/feature_specs/criteo_f15.yaml

For details, refer to the BYO dataset overview section.

Channel definitions and requirements

This model defines three channels:

categorical, accepting an arbitrary number of features
numerical, accepting an arbitrary number of features
label, accepting a single feature

The training script expects two mappings:

train
test

For performance reasons:

The only supported dataset type is split binary
Splitting chunks into multiple files is not supported.
Each categorical feature has to be provided in a separate chunk
All numerical features have to be provided in a single chunk
All numerical features have to appear in the same order in channel_spec and source_spec
Only integer types are supported for categorical features
Only float16 is supported for numerical features

BYO dataset constraints for the model

There are the following constraints of BYO dataset functionality for this model:

The performance of the model depends on the dataset size. Generally, the model should scale better for datasets containing more data points. For a smaller dataset, you might experience slower performance than the one reported for Criteo
Using other datasets might require tuning some hyperparameters (for example, learning rate, beta1 and beta2) to reach desired accuracy.
The optimized cuda interaction kernels for FP16 and TF32 assume that the number of categorical variables is smaller than WARP_SIZE=32 and embedding size is <=128

Preprocessing

The preprocessing scripts provided in this repository support running both on CPU and GPU using NVtabular (GPU only) and Apache Spark 3.0.

Please note that the preprocessing will require about 4TB of disk storage.

The syntax for the preprocessing script is as follows:

cd /workspace/dlrm/preproc
./prepare_dataset.sh <frequency_threshold> <GPU|CPU> <NVTabular|Spark>

For the Criteo Terabyte dataset, we recommend a frequency threshold of FL=3(when using A100 40GB or V100 32 GB) or FL=2(when using A100 80GB) if you intend to run the hybrid-parallel mode on multiple GPUs. If you want to make the model fit into a single NVIDIA Tesla V100-32GB, you can set FL=15.

The first argument means the frequency threshold to apply to the categorical variables. For a frequency threshold FL, the categorical values that occur less often than FL will be replaced with one special value for each category. Thus, a larger value of FL will require smaller embedding tables and will substantially reduce the overall size of the model.

The second argument is the hardware to use (either GPU or CPU).

The third arguments is a framework to use (either NVTabular or Spark). In case of choosing a CPU preprocessing this argument is omitted as it only Apache Spark is supported on CPU.

The preprocessing scripts make use of the following environment variables to configure the data directory paths:

download_dir – this directory should contain the original Criteo Terabyte CSV files
spark_output_path – directory to which the parquet data will be written
conversion_intermediate_dir – directory used for storing intermediate data used to convert from parquet to train-ready format
final_output_dir – directory to store the final results of the preprocessing which can then be used to train DLRM

In the final_output_dir will be three subdirectories created: train, test, validation, and one json file – model_size.json – containing a maximal index of each category. The train is the train dataset transformed from day_0 to day_22. The test is the test dataset transformed from the prior half of day_23. The validation is the dataset transformed from the latter half of day_23.

The model is tested on 3 datasets resulting from Criteo dataset preprocessing: small (Freqency threshold = 15), large (Freqency threshold = 3) and xlarge (Freqency threshold = 2). Each dataset occupies approx 370GB of disk space. Table below presents information on the supercomputer and GPU count that are needed to train model on particular dataset.

Dataset	GPU VRAM consumption*	Model checkpoint size*	FL setting	DGX A100 40GB, 1GPU	DGX A100 40GB, 8GPU	DGX A100 80GB, 1GPU	DGX A100 80GB, 8GPU	DGX-1** or DGX-2, 1 GPU	DGX-1** or DGX-2, 8GPU	DGX-2, 16GPU
small (FL=15)	20.5	15.0	15	Yes	Yes	Yes	Yes	Yes	Yes	Yes
large (FL=3)	132.3	81.9	3	NA	Yes	NA	Yes	NA	Yes	Yes
xlarge (FL=2)	198.8	141.3	2	NA	NA	NA	Yes	NA	NA	NA

*with default embedding dimension setting **DGX-1 V100 32GB

NVTabular

NVTabular preprocessing is calibrated to run on DGX A100 and DGX-2 AI systems. However, it should be possible to change the values of ALL_DS_MEM_FRAC, TRAIN_DS_MEM_FRAC, TEST_DS_MEM_FRAC, VALID_DS_MEM_FRAC in preproc/preproc_NVTabular.py, so that they'll work on also on other hardware platforms such as DGX-1 or a custom one.

Spark

The script spark_data_utils.py is a PySpark application, which is used to preprocess the Criteo Terabyte Dataset. In the Docker image, we have installed Spark 3.0.1, which will start a standalone cluster of Spark. The scripts run_spark_cpu.sh and run_spark_gpu.sh start Spark, then run several PySpark jobs with spark_data_utils.py.

Note that the Spark job requires about 3TB disk space used for data shuffling.

Spark preprocessing is calibrated to run on DGX A100 and DGX-2 AI systems. However, it should be possible to change the values in preproc/DGX-2_config.sh or preproc/DGX-A100_config.sh so that they'll work on also on other hardware platforms such as DGX-1 or a custom one.

Training process

The main training script resides in dlrm/scripts/main.py. Once the training is completed, it stores the checkpoint in the path specified by --save_checkpoint_path and a JSON training log in --log_path. The quality of the predictions generated by the model is measured by the ROC AUC metric. The speed of training and inference is measured by throughput i.e., the number of samples processed per second. We use mixed precision training with static loss scaling for the bottom and top MLPs while embedding tables are stored in FP32 format.

Inference process

This section describes inference with PyTorch in Python. If you're interested in inference using the Triton Inference Server, refer to https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Recommendation/DLRM/triton/README.md file.

Two modes for inference are currently supported by the dlrm/scripts/main.py script:

Inference benchmark – this mode will measure and print out throughput and latency numbers for multiple batch sizes. You can activate it by passing the --mode inference_benchmark command line flag. The batch sizes to be tested can be set with the --inference_benchmark_batch_sizes command-line argument.
Test-only – this mode can be used to run a full validation on a checkpoint to measure ROC AUC. You can enable it by passing --mode test.

Deploying DLRM Using NVIDIA Triton Inference Server

The NVIDIA Triton Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. More information on how to perform inference using NVIDIA Triton Inference Server can be found in https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Recommendation/DLRM/triton/README.md.