DLRM for PyTorch | NVIDIA NGC

NVIDIA

DLRM for PyTorch

Resource

NVIDIA

DLRM for PyTorch

The Deep Learning Recommendation Model (DLRM) is a recommendation model designed to make use of both categorical and numerical inputs.

To train your model using mixed or TF32 precision with Tensor Cores or using FP32, perform the following steps using the default parameters of DLRM on the Criteo Terabyte dataset. For the specifics concerning training and inference, see the Advanced section.

Clone the repository.

git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/PyTorch/Recommendation/DLRM

Download the dataset.

You can download the data by following the instructions at: http://labs.criteo.com/2013/12/download-terabyte-click-logs/. When you have successfully downloaded it and unpacked it, set the CRITEO_DATASET_PARENT_DIRECTORY to its parent directory:

CRITEO_DATASET_PARENT_DIRECTORY=/raid/dlrm

We recommend to choose the fastest possible file system, otherwise it may lead to an IO bottleneck.

Build DLRM Docker containers

docker build -t nvidia_dlrm_pyt .
docker build -t nvidia_dlrm_preprocessing -f Dockerfile_preprocessing . --build-arg DGX_VERSION=[DGX-2|DGX-A100]

Start an interactive session in the NGC container to run preprocessing. The DLRM PyTorch container can be launched with:

docker run --runtime=nvidia -it --rm --ipc=host  -v ${CRITEO_DATASET_PARENT_DIRECTORY}:/data/dlrm nvidia_dlrm_preprocessing bash

Preprocess the dataset.

Here are a few examples of different preprocessing commands. Out of the box, we support preprocessing on DGX-2 and DGX A100 systems. For the details on how those scripts work and detailed description of dataset types (small FL=15, large FL=3, xlarge FL=2), system requirements, setup instructions for different systems and all the parameters consult the preprocessing section. For an explanation of the FL parameter, see the Dataset Guidelines and Preprocessing sections.

Depending on dataset type (small FL=15, large FL=3, xlarge FL=2) run one of following command:

4.1. Preprocess to small dataset (FL=15) with Spark GPU:

cd /workspace/dlrm/preproc
./prepare_dataset.sh 15 GPU Spark

4.2. Preprocess to large dataset (FL=3) with Spark GPU:

cd /workspace/dlrm/preproc
./prepare_dataset.sh 3 GPU Spark

4.3. Preprocess to xlarge dataset (FL=2) with Spark GPU:

cd /workspace/dlrm/preproc
./prepare_dataset.sh 2 GPU Spark

Start training.

First start the docker container (adding --security-opt seccomp=unconfined option is needed to take the full advantage of processor affinity in multi-GPU training):

docker run --security-opt seccomp=unconfined --runtime=nvidia -it --rm --ipc=host  -v ${PWD}/data:/data nvidia_dlrm_pyt bash

single-GPU:

python -m dlrm.scripts.main --mode train --dataset /data/dlrm/binary_dataset/ --amp --cuda_graphs

multi-GPU for DGX A100:

python -m torch.distributed.launch --no_python --use_env --nproc_per_node 8 \
          bash  -c './bind.sh --cpu=dgxa100_ccx.sh --mem=dgxa100_ccx.sh python -m dlrm.scripts.main \
          --dataset /data/dlrm/binary_dataset/ --seed 0 --epochs 1 --amp --cuda_graphs'

multi-GPU for DGX-1 and DGX-2:

python -m torch.distributed.launch --no_python --use_env --nproc_per_node 8 \
          bash  -c './bind.sh  --cpu=exclusive -- python -m dlrm.scripts.main \
          --dataset /data/dlrm/binary_dataset/ --seed 0 --epochs 1 --amp --cuda_graphs'

Start validation/evaluation. If you want to run validation or evaluation, you can either:

use the checkpoint obtained from the training commands above, or
download the pretrained checkpoint from NGC.

In order to download the checkpoint from NGC, visit ngc.nvidia.com website and browse the available models. Download the checkpoint files and unzip them to some path, for example, to $CRITEO_DATASET_PARENT_DIRECTORY/checkpoints/. The checkpoint requires around 15GB of disk space.

Commands:

single-GPU:

python -m dlrm.scripts.main --mode test --dataset /data/dlrm/binary_dataset/ --load_checkpoint_path `$CRITEO_DATASET_PARENT_DIRECTORY/checkpoints/checkpoint`

multi-GPU for DGX A100:

python -m torch.distributed.launch --no_python --use_env --nproc_per_node 8 \
          bash  -c './bind.sh --cpu=dgxa100_ccx.sh --mem=dgxa100_ccx.sh python -m dlrm.scripts.main \
          --dataset /data/dlrm/binary_dataset/ --seed 0 --epochs 1 --amp --load_checkpoint_path `$CRITEO_DATASET_PARENT_DIRECTORY/checkpoints/checkpoint`'

multi-GPU for DGX-1 and DGX-2:

python -m torch.distributed.launch --no_python --use_env --nproc_per_node 8 \
          bash  -c './bind.sh  --cpu=exclusive -- python -m dlrm.scripts.main \
          --dataset /data/dlrm/binary_dataset/ --seed 0 --epochs 1 --amp --load_checkpoint_path `$CRITEO_DATASET_PARENT_DIRECTORY/checkpoints/checkpoint`'