Nemotron-4-340B-Reward

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

Nemotron-4-340B-Reward

Publisher

NVIDIA

Latest Version

1.1

Modified

November 12, 2024

Size

635.23 GB

Nemotron-4-340B-Reward

Model Overview

The Nemotron-4-340B-Reward is a multidimensional Reward Model (outputs multiple scalar values) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. Nemotron-4-340B-Reward consists of the Nemotron-4-340B-Base model and a linear layer that converts the final layer representation of the end-of-response token into five scalar values, each corresponding to a HelpSteer attribute. It supports a context length of up to 4,096 tokens.

Given a conversation with multiple turns between user and assistant, it rates the following attributes (between 0 and 4) for every assistant turn.

Helpfulness: Overall helpfulness of the response to the prompt.
Correctness: Inclusion of all pertinent facts without errors.
Coherence: Consistency and clarity of expression.
Complexity: Intellectual depth required to write response (i.e. whether the response can be written by anyone with basic language competency or requires deep domain expertise).
Verbosity: Amount of detail included in the response, relative to what is asked for in the prompt.

Under the NVIDIA Open Model License, NVIDIA confirms: Models are commercially usable. You are free to create and distribute Derivative Models. NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.

License:

NVIDIA Open Model License

Intended use

Nemotron-4-340B-Reward is a pre-trained Reward Model intended for use in English Synthetic Data Generation and English Reinforcement Learning from AI Feedback (RLAIF).

Nemotron-4 340B-Reward can be used in the alignment stage to align pre-trained models to human preferences. It can also be applied in scenarios such as Reward-Model-as-a-Judge.

Model Developer: NVIDIA

Model Dates: Nemotron-4-340B-Reward was trained between December 2023 and May 2024.

Data Freshness: The pretraining data has a cutoff of June 2023.

Required Hardware

BF16 Inference:

16x H100 (2x H100 nodes)
16x A100 80GB (2x A100 80GB nodes)

Usage:

Nemotron-4-340B-Reward is compatible with NVIDIA NeMo Framework, and you can use the model with NeMo Aligner following the SteerLM training user guide.

Spin up an inference server within the NeMo Aligner container

python /opt/NeMo-Aligner/examples/nlp/gpt/serve_reward_model.py \
      rm_model_file=Nemotron-4-340B-Reward \
      trainer.num_nodes=2 \
      trainer.devices=8 \
      ++model.tensor_model_parallel_size=8 \
      ++model.pipeline_model_parallel_size=2 \
      inference.micro_batch_size=2 \
      inference.port=1424

Annotate data files using the served reward model. As an example, this can be the Open Assistant train/val files. Then follow the next step to train a SteerLM model based on SteerLM training user guide .

python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_openassistant_data.py --output_directory=data/oasst

python /opt/NeMo-Aligner/examples/nlp/data/steerlm/attribute_annotate.py \
      --input-file=data/oasst/train.jsonl \
      --output-file=data/oasst/train_labeled.jsonl \
      --port=1424

Alternatively, this can be any conversational data file (in .jsonl) in the following format, where each line looks like

{
    "conversations": [
              {"value": <user_turn_1>, "from": "User", "label": None},
              {"value": <assistant_turn_1>, "from": "Assistant", "label": <formatted_label_1>},
              {"value": <user_turn_2>, "from": "User", "label": None},
              {"value": <assistant_turn_2>, "from": "Assistant", "label": <formatted_label_2>},
          ],
    "mask": "User"
}

Ideally, each <formatted_label_n> refers to the ground truth label for the assistant turn but if they are not available, we can also use helpfulness:4,correctness:4,coherence:4,complexity:2,verbosity:2 (i.e. defaulting to moderate complexity and verbosity, adjust if needed. or simply helpfulness:-1. It must not be None or an empty string.

Model Architecture:

Nemotron-4-340B-Reward is extended from Nemotron-4-340B-Base with an additional linear layer. It was trained with a global batch-size of 128.

Architecture Type: Transformer Decoder (auto-regressive language model)

Intended use

Nemotron-4-340B-Reward is a pre-trained Reward Model intended for use in English Synthetic Data Generation and English Reinforcement Learning from AI Feedback (RLAIF).

Dataset & Training

Nemotron-4-340B-Reward was trained for 2 epochs using the NVIDIA HelpSteer2 data. The HelpSteer2 dataset is a permissively licensed preference dataset (CC-by-4.0) with ten thousand English response pairs and can be found here.

Evaluation Results

Reward Bench Primary Dataset

Evaluated using RewardBench - as introduced in the paper RewardBench: Evaluating Reward Models for Language Modeling.

Overall	Chat	Chat-Hard	Safety	Reasoning
92.0	95.8	87.1	91.5	93.7

Limitations

The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards here. Please report security vulnerabilities or NVIDIA AI Concerns here.