NGC | Catalog
CatalogContainersHugeCTR Release Container

HugeCTR Release Container

For copy image paths and more information, please view on a desktop device.
Logo for HugeCTR Release Container


HugeCTR is a high-efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training. HugeCTR is a component of the NVIDIA Merlin, a recommender system framework in open beta.



Latest Tag



April 4, 2023

Compressed Size

5.27 GB

Multinode Support


Multi-Arch Support


v2.3 (Latest) Security Scan Results

No results available.

This container is deprecated, please use one of the HugeCTR compatible containers found here.

HugeCTR is a recommender specific framework which is capable of distributed training across multiple GPUs and nodes for Click-Through-Rate (CTR) estimation. It is a component of NVIDIA Merlin, which is an open beta framework accelerating the entire pipeline from data ingestion and training to deploying GPU-accelerated recommender systems. HugeCTR supports model-parallel embedding tables and data-parallel neural networks and their variants such as Wide and Deep Learning (WDL), Deep Cross Network (DCN), DeepFM, and Deep Learning Recommendation Model (DLRM). For additional information, see HugeCTR User Guide.

Design Goals:

  • Fast: HugeCTR is a speed-of-light CTR model framework.
  • Dedicated: HugeCTR provides the essentials so that you can train your CTR model in an efficient manner.
  • Easy: Regardless of whether you are a data scientist or machine learning practitioner, we've made it easy for anybody to use HugeCTR.

Release Notes

What's New in Version 2.3

We’ve implemented the following enhancements to improve usability and performance:

  • Python Interface: To enhance the interoperability with NVTabular and other Python-based libraries, we're introducing a new Python interface for HugeCTR. If you are already using HugeCTR with JSON, the transition to Python will be seamless for you as you'll only have to locate the file and set the PYTHONPATH environment variable. You can still configure your model in your JSON config file, but the training options such as batch_size must be specified through hugectr.solver_parser_helper() in Python. For additional information regarding how to use the HugeCTR Python API and comprehend its API signature, see our Jupyter Notebook tutorial.

  • HugeCTR Embedding with Tensorflow: To help users easily integrate HugeCTR’s optimized embedding into their Tensorflow workflow, we now offer the HugeCTR embedding layer as a Tensorflow plugin. To better understand how to intall, use, and verify it, see our Jupyter notebook tutorial. It also demonstrates how you can create a new Keras layer EmbeddingLayer based on the helper code that we provide.

  • Model Oversubscription: To enable a model with large embedding tables that exceeds the single GPU's memory limit, we added a new model prefetching feature, giving you the ability to load a subset of an embedding table into the GPU in a coarse grained, on-demand manner during the training stage. To use this feature, you need to split your dataset into multiple sub-datasets while extracting the unique key sets from them. This feature can only currently be used with a Norm dataset format and its corresponding file list. This feature will eventually support all embedding types and dataset formats. We revised our criteo2hugectr tool to support the key set extraction for the Criteo dataset. For additional information, see our Python Jupyter Notebook to learn how to use this feature with the Criteo dataset. Please note that The Criteo dataset is a common use case, but model prefetching is not limited to only this dataset.

  • Enhanced AUC Implementation: To enhance the performance of our AUC computation on multi-node environments, we redesigned our AUC implementation to improve how the computational load gets distributed across nodes.

  • Epoch-Based Training: In addition to max_iter, a HugeCTR user can set num_epochs in the Solver clause of their JSON config file. This mode can only currently be used with Norm dataset formats and their corresponding file lists. All dataset formats will be supported in the future.

  • Multi-Node Training Tutorial: To better support multi-node training use cases, we added a new a step-by-step tutorial.

  • Power Law Distribution Support with Data Generator: Because of the increased need for generating a random dataset whose categorical features follows the power-law distribution, we revised our data generation tool to support this use case. For additional information, refer to the --long-tail description here.

  • Multi-GPU Preprocessing Script for Criteo Samples: Multiple GPUs can now be used when preparing the dataset for our samples. For additional information, see how is used to preprocess the Criteo dataset for DCN, DeepFM, and W&D samples.

Known Issues

  • Since the automatic plan file generator is not able to handle systems that contain one GPU, a user must manually create a JSON plan file with the following parameters and rename using the name listed in the HugeCTR configuration file: {"type": "all2all", "num_gpus": 1, "main_gpu": 0, "num_steps": 1, "num_chunks": 1, "plan": [[0, 0]], "chunks": [1]}.
  • If using a system that contains two GPUs with two NVLink connections, the auto plan file generator will print the following warning message: RuntimeWarning: divide by zero encountered in true_divide. This is an erroneous warning message and should be ignored.
  • The current plan file generator doesn't support a system where the NVSwitch or a full peer-to-peer connection between all nodes is unavailable.
  • Users need to set an export CUDA_DEVICE_ORDER=PCI_BUS_ID environment variable to ensure that the CUDA runtime and driver have a consistent GPU numbering.
  • LocalizedSlotSparseEmbeddingOneHot only supports a single-node machine where all the GPUs are fully connected such as NVSwitch.
  • HugeCTR version 2.2.1 crashes when running our DLRM sample on DGX2 due to a CUDA Graph issue. To run the sample on DGX2, disable the use of CUDA Graph with "cuda_graph": false even if it degrades the performance a bit. We are working on fixing this issue. This issue doesn't exist when using the DGX A100.
  • The model prefetching feature is only available in Python. Currently, a user can only use this feature with the DistributedSlotSparseEmbeddingHash embedding and the Norm dataset format on single GPUs. This feature will eventually support all embedding types and dataset formats.
  • The HugeCTR embedding TensorFlow plugin only works with single-node machines.
  • The HugeCTR embedding TensorFlow plugin assumes that the input keys are in int64 and its output is in float.
  • When using our embedding plugin, please note that the fprop_v3 function, which is available in, only works with DistributedSlotSparseEmbeddingHash.

Quickstart Guide for HugeCTR on NGC

The quickstart guide will help you launch the HugeCTR Container.

1. Launch the Container

Before running the container, use docker pull to ensure an up-to-data image is installed.

  1. In the Tags tab, find the tag of an image which you want to use.

  2. From Pull Command section, copy the docker pull ... command by clicking an icon on the right side. Then, replace the default tag with an the tag you decided to use, e.g., hugectr:.

  3. Execute the pull command on the host machine. It may take several minutes to pull the container image. Once it is finished, you are ready to launch your container.

  4. Launch the container.

If you have Docker 19.03 or later, a typical command to launch the container is:

docker run --gpus all --rm -u $(id -u):$(id -g) -v $(pwd):/hugectr -w /hugectr huge_ctr --train /hugectr/dlrm_fp16_1gpu.json

If you have Docker 19.02 or earlier, a typical command to launch the container is:

nvidia-docker run --rm -u $(id -u):$(id -g) -v $(pwd):/hugectr -w /hugectr huge_ctr --train /hugectr/dlrm_fp16_1gpu.json


  • --rm will delete the container when finished

  • -u $(id -u):$(id -g) makes the container run as the same user of the host machine

  • -v $(pwd):/hugectr mount directories to the container for importing and exporting data

  • :v2.3 chooses the v2.3 version of the HugeCTR Container

  • -w /hugectr make working directory inside the container as /hugectr

  • huge_ctr --train perform hugectr model training inside the container

  • /hugectr/dlrm_fp16_1gpu.json The absolute path of the josn file inside the container

    NOTE: If you mount directories to the container for importing and exporting data by -v, please double check if you are specifying your JSON config file path as the one visible inside the container. For example, if the real path of a JSON config file is /first/second/dlrm_fp16_1gpu.json,the path inside the container with the mount(-v /first/second/:/hugectr) is /hugectr/dlrm_fp16_1gpu.json. The absolute path of training data must follow the aforementioned rule as well.

For more instructions, see the HugeCTR Mainpage.

2. Train on multi-node environment

To train models using multiple nodes, please prefer to the tutorial.

3. Train by using the Python interface

If you'd like to quickly train a model using the Python interface, follow the following six steps. For more information, please refer the HugeCTR Mainpage.

  1. Start the container with your local host directory (/your/host/dir mounted) by running the following command:

    docker run --runtime=nvidia --rm -v /your/host/dir:/your/container/dir -w /your/container/dir -it -u $(id -u):$(id -g) -it

    NOTE: The /your/host/dir directory is just as visible as the /your/container/dir directory. The /your/host/dir directory is also your starting directory.

  2. Inside the container, copy the DCN JSON config file to our mounted directory (/your/container/dir).

    This config file specifies the DCN model architecture and its optimizer. With any Python use case, the solver clause within the config file is not used at all.

  3. Generate a synthetic dataset based on the config file by running the following command:

    data_generator ./dcn.json ./dataset_dir 434428 1

    The following set of files are created: ./file_list.txt, ./file_list_test.txt, and ./dataset_dir/*.

  4. Write a simple Python code using the hugectr module as shown here:

    import sys
    import hugectr
    from mpi4py import MPI
    def train(json_config_file):
      solver_config = hugectr.solver_parser_helper(batchsize = 16384,
                                                   batchsize_eval = 16384,
                                                   vvgpu = [[0,1,2,3,4,5,6,7]],
                                                   repeat_dataset = True)
    sess = hugectr.Session(solver_config, json_config_file)
    for i in range(10000):
      if (i % 100 == 0):
        loss = sess.get_current_loss()
        print("[HUGECTR][INFO] iter: {}; loss: {}".format(i, loss))
    if __name__ == "__main__":
      json_config_file = sys.argv[1]

    NOTE: Update the vvgpu (the active GPUs), batchsize, and batchsize_eval parameters according to your GPU system.

  5. Train the model by running the following command:

    python dcn.json

4. Synthetic Data Generation and Benchmark

For quick benchmarking and research use, you can generate a synthetic dataset like below. Without any additional modification to JSON file. Both Norm format (with Header) and Raw format (without Header) dataset can be generated with data_generator.

  • For Norm format:
    $ ./data_generator your_config.json data_folder vocabulary_size max_nnz #files #samples per file
    $ ./huge_ctr --train your_config.json
  • For Raw format:
    $ ./data_generator your_config.json
    $ ./huge_ctr --train your_config.json


  • data_folder: You have to specify the folder to store the generated data
  • vocabulary_size: Vocabulary size of your target dataset
  • max_nnz: [1, max_nnz] values will be generated for each feature (slot) in the dataset. Note that max_nnz*slot_num should be less than max_feature_num in your data layer.
  • #files: number of data file will be generated (optional)
  • #samples per file: number of samples per file (optional)

Supported Compute Capabilities

Compute Compatibility GPU
60 NVIDIA P100 (Pascal)
70 NVIDIA V100 (Volta)
75 NVIDIA T4 (Turing)
80 NVIDIA A100 (Ampere)

Technical Support

If you encounter any issues and/or have questions, please file an issue here so that we can provide you with the necessary resolutions and answers. To further advance the Merlin/HugeCTR Roadmap, we encourage you to share all the details regarding your recommender system pipeline using this survey.

Suggested Reading

HugeCTR User Guide:

Questions and Answers:

Sample models and their end-to-end instructions:

NVIDIA Developer Blog: Announcing NVIDIA Merlin: An Application Framework for Deep Recommender Systems