NGC Catalog
CLASSIC
Welcome Guest
Containers
NVIDIA Retriever and RAG Eval

NVIDIA Retriever and RAG Eval

For copy image paths and more information, please view on a desktop device.
Logo for NVIDIA Retriever and RAG Eval
Description
NVIDIA Evals Factory-compatible container with Retriever and RAG Eval support
Publisher
NVIDIA
Latest Tag
25.05.1
Modified
June 24, 2025
Compressed Size
287.97 MB
Multinode Support
No
Multi-Arch Support
No
25.05.1 (Latest) Security Scan Results

Linux / amd64

Sorry, your browser does not support inline SVG.

RAG and Retriever Evaluation Toolkit

This toolkit provides a standardized way to run RAG (Retrieval Augmented Generation) and Retriever evaluations.

Index

  • Overview
  • Quick Start Guide
  • Advanced Usage
    • Evaluation Job Config
    • Metrics
  • How Evaluation Works
  • Troubleshooting

Overview

This toolkit allows users to evaluate:

  • Retriever Pipelines: Assess the performance of document retrieval systems.
  • RAG Pipelines: Evaluate end-to-end RAG systems, including retrieval, context augmentation, and language model generation.

Evaluations are configured using YAML files, specifying datasets, models, metrics, and other parameters.

Quick Start Guide

Prerequisites

  • Python 3.10+
  • Necessary API keys for model endpoints (e.g., NVIDIA API Catalog)
  • Access to a running Milvus instance if your retriever configuration in the evaluation job file specifies Milvus as the backend (e.g., uses milvus_host, milvus_port).

Launching an evaluation task

  1. Prepare your Evaluation Job File(s): You'll need a YAML configuration file for your evaluation. See Evaluation Job Config for details.

  2. Run the evaluation:

    There are two main ways to run the evaluation:

    Option 1: Running inside a benchmark container (using Docker)

    docker run \
      --network milvus-network \
      -e NVIDIA_API_KEY="YOUR_NVIDIA_API_KEY" \
      # Replace YOUR_IMAGE_ID_OR_TAG with your actual Docker image ID or tag
      YOUR_IMAGE_ID_OR_TAG \
      nv_eval run_eval --run_config /workspace/tests/rag/test_tc9_answer_generation_answer_evaluation_rags.yml --output_dir /workspace/results 
    
    • --network milvus-network: Connects the container to a Docker network named milvus-network. This is necessary if your evaluation uses a Milvus instance running in another Docker container on this network.
      • Alternatively, you can configure the file-based version of the Milvus vector database, e.g. see eval_factory_pipeline/tests/rag/test_tc5_e2e_squad.yml.
    • -e NVIDIA_API_KEY="YOUR_NVIDIA_API_KEY": Passes your NVIDIA API key as an environment variable to the container. Replace "YOUR_NVIDIA_API_KEY" with your actual key.
    • YOUR_IMAGE_ID_OR_TAG: The ID or tag of your evaluation Docker image.
    • The nv_eval run_eval command is then executed inside the container, referencing the mounted configuration file.

    Option 2: Running in a local Python environment

    Use the core-evals-rag-retriever-eval command-line interface.

    core-evals-rag-retriever-eval run_eval --run_config eval_factory_pipeline/tests/retriever/test_embedding_only_nfcorpus.yml --output_dir output/test_embedding_only_nfcorpus
    

Checking results

Results will be saved to the <output_dir>/results.yml file. The structure and content of the results will depend on the evaluation type (RAG or Retriever) and the metrics configured.

Example output structure:

command: "export API_KEY=$NVIDIA_API_KEY &&  rag_retriever_eval \\\n  --api_endpoint\
  \ '{\"api_key\": \"NVIDIA_API_KEY\", \"model_id\": \"nvdev/meta/llama-3.1-8b-instruct\"\
  , \"stream\": null, \"type\": \"chat\", \"url\": \"https://integrate.api.nvidia.com/v1\"\
  }' \\\n  --output_dir output/test_tc5_final \\\n  --pipeline '{\"context_ordering\"\
  : \"desc\", \"params\": {\"prompt_template_path\": \"/workspace/tests/rag/templates/prompt_template.jinja\"\
  }, \"retriever\": {\"pipeline\": {\"index_embedding_model\": {\"api_endpoint\":\
  \ {\"api_key\": null, \"format\": \"nim\", \"model_id\": \"nvidia/nv-embedqa-e5-v5\"\
  , \"url\": \"https://integrate.api.nvidia.com/v1\"}}, \"params\": {\"component_inputs_template\"\
  : \"{\\\"embedder\\\": {\\\"text\\\": \\\"${query}\\\"} }\", \"index_pipeline_yaml_file\"\
  : \"/workspace/tests/retriever/templates/dense_only/milvus_index_nim.yaml\"\
  , \"milvus_collection_name\": \"rag_test\", \"milvus_host\": \"localhost\", \"milvus_password\"\
  : \"\", \"milvus_port\": \"19530\", \"query_pipeline_yaml_file\": \"/workspace/tests/retriever/templates/dense_only/milvus_query_nim.yaml\"\
  }, \"query_embedding_model\": {\"api_endpoint\": {\"api_key\": null, \"format\"\
  : \"nim\", \"model_id\": \"nvidia/nv-embedqa-e5-v5\", \"url\": \"https://integrate.api.nvidia.com/v1\"\
  }}, \"top_k\": 10}}}' \\\n  --tasks '{\"rag\": {\"dataset\": {\"format\": \"squad\"\
  , \"path\": \"/workspace/tests/datasets/fiqa_synthetic_squad.json\"\
  }, \"metrics\": {\"rag_answer_relevancy\": {\"params\": {}, \"type\": \"ragas\"\
  }, \"rag_faithfulness\": {\"params\": {}, \"type\": \"ragas\"}, \"retriever_ndcg_cut_10\"\
  : {\"params\": {}, \"type\": \"pytrec_eval\"}, \"retriever_ndcg_cut_5\": {\"params\"\
  : {}, \"type\": \"pytrec_eval\"}, \"retriever_recall_10\": {\"params\": {}, \"type\"\
  : \"pytrec_eval\"}, \"retriever_recall_5\": {\"params\": {}, \"type\": \"pytrec_eval\"\
  }}, \"params\": {\"judge_embeddings\": \"nvidia/nv-embedqa-e5-v5\", \"judge_embeddings_api_key\"\
  : null, \"judge_embeddings_url\": \"https://integrate.api.nvidia.com/v1\", \"judge_llm\"\
  : \"nvdev/meta/llama-3.1-8b-instruct\", \"judge_llm_api_key\": null, \"judge_llm_url\"\
  : \"https://integrate.api.nvidia.com/v1\", \"judge_max_retries\": 5, \"judge_max_workers\"\
  : 2, \"judge_request_timeout\": 120}, \"type\": \"rag\"}}' \\\n  --type rag"
config:
  output_dir: output/test_tc5_final
  params:
    extra:
      pipeline:
        context_ordering: desc
        params:
          prompt_template_path: /workspace/tests/rag/templates/prompt_template.jinja
        retriever:
          pipeline:
            index_embedding_model:
              api_endpoint:
                api_key: null
                format: nim
                model_id: nvidia/nv-embedqa-e5-v5
                url: https://integrate.api.nvidia.com/v1
            params:
              component_inputs_template: '{"embedder": {"text": "${query}"} }'
              index_pipeline_yaml_file: /workspace/tests/retriever/templates/dense_only/milvus_index_nim.yaml
              milvus_collection_name: rag_test
              milvus_host: localhost
              milvus_password: ''
              milvus_port: '19530'
              query_pipeline_yaml_file: /workspace/tests/retriever/templates/dense_only/milvus_query_nim.yaml
            query_embedding_model:
              api_endpoint:
                api_key: null
                format: nim
                model_id: nvidia/nv-embedqa-e5-v5
                url: https://integrate.api.nvidia.com/v1
            top_k: 10
      tasks:
        rag:
          dataset:
            format: squad
            path: /workspace/tests/datasets/fiqa_synthetic_squad.json
          metrics:
            rag_answer_relevancy:
              params: {}
              type: ragas
            rag_faithfulness:
              params: {}
              type: ragas
            retriever_ndcg_cut_10:
              params: {}
              type: pytrec_eval
            retriever_ndcg_cut_5:
              params: {}
              type: pytrec_eval
            retriever_recall_10:
              params: {}
              type: pytrec_eval
            retriever_recall_5:
              params: {}
              type: pytrec_eval
          params:
            judge_embeddings: nvidia/nv-embedqa-e5-v5
            judge_embeddings_api_key: null
            judge_embeddings_url: https://integrate.api.nvidia.com/v1
            judge_llm: nvdev/meta/llama-3.1-8b-instruct
            judge_llm_api_key: null
            judge_llm_url: https://integrate.api.nvidia.com/v1
            judge_max_retries: 5
            judge_max_workers: 2
            judge_request_timeout: 120
          type: rag
  type: rag
git_hash: null
results:
  groups: {}
  tasks:
    rag:
      metrics:
        rag_answer_relevancy:
          scores:
            answer_relevancy:
              stats: {}
              value: 0.572086403028116
        rag_faithfulness:
          scores:
            faithfulness:
              stats: {}
              value: 0.7331979311011568
        retriever_retriever.ndcg_cut_10:
          scores:
            ndcg_cut_10:
              stats: {}
              value: 0.9442803161530771
        retriever_retriever.ndcg_cut_5:
          scores:
            ndcg_cut_5:
              stats: {}
              value: 0.9400758007845321
        retriever_retriever.recall_10:
          scores:
            recall_10:
              stats: {}
              value: 0.9817073170731707
        retriever_retriever.recall_5:
          scores:
            recall_5:
              stats: {}
              value: 0.9695121951219512
target:
  api_endpoint:
    api_key: NVIDIA_API_KEY
    model_id: nvdev/meta/llama-3.1-8b-instruct
    type: chat
    url: https://integrate.api.nvidia.com/v1

Advanced Usage

Evaluation Job Config

See eval_factory_pipeline/tests for examples.

The evaluation job configuration is a YAML file that specifies the evaluation type, dataset, metrics, and target system to evaluate. Here's a breakdown of its structure:

config:
  type: rag  # Either 'rag' or 'retriever'
  params:
    extra:
      tasks:
        rag:  # This key should match the 'type' above
          type: rag
          dataset:
            format: ragas  # Format of the dataset (e.g., 'ragas', 'squad', 'beir')
            path: /path/to/dataset.jsonl  # Path to the dataset file
          params:
            judge_llm: meta/llama-3.1-8b-instruct  # LLM used for evaluation
            judge_llm_url: https://integrate.api.nvidia.com/v1
            judge_llm_api_key: null  # Will use environment variable if null
            judge_embeddings: nvidia/nv-embedqa-e5-v5  # Embedding model for evaluation
            judge_embeddings_url: https://integrate.api.nvidia.com/v1
            judge_embeddings_api_key: null
            judge_request_timeout: 120
            judge_max_retries: 5
            judge_max_workers: 2
          metrics:
            rag_faithfulness:  # Metric name
              type: ragas  # Calculation method: `pytrec_eval` or `ragas`
              params: {}  # Additional parameters for this metric
            # ... more metrics ...
      pipeline:
        context_ordering: desc  # How to order retrieved contexts: 'asc' or 'desc'
        retriever:  # Optional retriever configuration
          # ... retriever details see below ...
target:
  api_endpoint:  # Generative LLM endpoint
    url: https://integrate.api.nvidia.com/v1
    model_id: meta/llama-3.1-8b-instruct
    api_key: NVIDIA_API_KEY  # Only here it should be an environment variable name
    type: chat

target Section

This section defines the target endpoint for the generative LLM.

  • api_endpoint: Configuration for the generative LLM endpoint.
    • url: The API endpoint URL.
    • model_id: The model identifier to use.
    • api_key: Must be provided via an environment variable name, not by value as in other sections.
    • type: The endpoint should be a chat endpoint.

config.params.extra.tasks Section

This section defines one or more evaluation tasks.

  • dataset: Specifies the dataset format (squad, beir, ragas, etc.) and path.
    • For beir format, path can be a BEIR dataset identifier (e.g., fiqa) which the tool can download.
    • For other formats like squad or ragas, path should be the file path to the dataset (e.g., /path/to/your/dataset.json or /path/to/your/dataset.jsonl).
  • params: Contains parameters for the evaluation process, especially for the "judge" models used in RAGAS metrics.
    • judge_llm, judge_llm_url, judge_llm_api_key: Configuration for the LLM used for judging (e.g., for faithfulness, answer_relevancy).
    • judge_embeddings, judge_embeddings_url, judge_embeddings_api_key: Configuration for the embedding model used by judge (e.g., for answer_similarity).
  • metrics: A dictionary of metrics to compute.
    • Keys are custom metric names (e.g., retriever_recall_5, rag_faithfulness).
    • type: Specifies the metric calculation method (e.g., pytrec_eval, ragas).

config.params.extra.pipeline Section

This section describes the system being evaluated.

  • params.prompt_template_path: Path to a Jinja2 file defining how the query and retrieved documents are presented to the generative LLM.
  • retriever.pipeline:
    • query_embedding_model: Defines the model used to embed queries.
    • index_embedding_model: Defines the model used to embed documents for the index.
    • reranker_model: Optional. Configuration for a reranking model applied after initial retrieval.
    • top_k: Number of documents to retrieve.
    • params: Parameters specific to the retriever implementation (e.g., Haystack pipeline files, Milvus connection details).
      • index_pipeline_yaml_file, query_pipeline_yaml_file: Paths to Haystack pipeline definitions.
      • component_inputs_template: Template for providing input to Haystack components.
      • milvus_host, milvus_port, milvus_collection_name: Details for connecting to a Milvus vector database.

RAG pipeline example:

config:
  type: rag
  params:
    extra:
      ...
      pipeline:
        context_ordering: desc # Optional: 'asc' or 'desc'
        params:
          prompt_template_path: ... # Path to a Jinja2 file defining how the query and retrieved documents are presented to the generative LLM endpoint.
        retriever:             # Retriever configuration (see retriever target below)
          pipeline:
            top_k: 10
            query_embedding_model:
              api_endpoint:
                url: https://integrate.api.nvidia.com/v1
                model_id: nvidia/nv-embedqa-e5-v5
                api_key: null
                format: nim
            index_embedding_model: # Often same as query_embedding_model
              api_endpoint:
                url: https://integrate.api.nvidia.com/v1
                model_id: nvidia/nv-embedqa-e5-v5
                api_key: null
                format: nim
            params: # Retriever-specific parameters (e.g., for Haystack, Milvus)
              index_pipeline_yaml_file: retriever_templates/dense_only/milvus_index_nim.yaml
              query_pipeline_yaml_file: retriever_templates/dense_only/milvus_query_nim.yaml
              component_inputs_template: '{"embedder": {"text": "${query}"} }' # For Haystack
              milvus_host: 172.20.0.2
              milvus_port: "19530"
              milvus_collection_name: rag_test
              # Note: If using Milvus-backed retrieval, ensure you have a Milvus server running and accessible 
              # with the specified host, port, and that the collection can be created or already exists 
              # as per your pipeline's requirements.

Retriever pipeline example:

target:
  type: retriever_pipeline
  retriever:
    pipeline:
      query_embedding_model:
        api_endpoint:
          url: https://integrate.api.nvidia.com/v1
          model_id: nvidia/nv-embedqa-e5-v5
          api_key: null # Set via env var or here
      index_embedding_model: # Often same as query_embedding_model
        api_endpoint:
          url: https://integrate.api.nvidia.com/v1
          model_id: nvidia/nv-embedqa-e5-v5
          api_key: null # Set via env var or here
      reranker_model: null # Optional: Configuration for a reranker model
      top_k: 10
      params: # Retriever-specific parameters (e.g., for Haystack, Milvus)
        index_pipeline_yaml_file: retriever_templates/dense_only/milvus_index_nim.yaml
        query_pipeline_yaml_file: retriever_templates/dense_only/milvus_query_nim.yaml
        component_inputs_template: '{"embedder": {"text": "${query}"} }' # For Haystack
        milvus_host: localhost
        milvus_port: "19530"
        milvus_collection_name: nfcorpus_test
        # Note: If using Milvus-backed retrieval, ensure you have a Milvus server running and accessible 
        # with the specified host, port, and that the collection can be created or already exists 
        # as per your pipeline's requirements.
        retriever_name: nim-retriever # Identifier for the retriever
        retriever_type: nvidia-nemo-nim # Type of retriever

API Keys

API keys for model endpoints (generation, embedding, judge LLMs, judge embeddings) can be:

  1. Set directly in the YAML file (e.g., api_key: "YOUR_API_KEY").
  2. Set to null in the YAML and provided via an environment variable. The specific environment variable name might depend on the toolkit's implementation, but NVIDIA_API_KEY is a common convention for NVIDIA endpoints. Also, the launchers include logic to pick up default key from API_KEY environment variable -- try it first.

Metrics

Retriever Metrics

Calculated if a retriever is part of the RAG pipeline or for standalone retriever evaluations.

  • pytrec_eval based:
    • retriever_recall_K (e.g., retriever_recall_5, retriever_recall_10)
    • retriever_ndcg_cut_K (e.g., retriever_ndcg_cut_5, retriever_ndcg_cut_10)
    • And other metrics supported by pytrec_eval like MAP, MRR, etc.

RAG Metrics

Calculated during RAG evaluation, often using the Ragas library. These metrics typically require "judge" LLMs and "judge" embedding models.

  • ragas based:
    • rag_faithfulness: Measures if the answer is supported by the retrieved context.
    • rag_answer_correctness: Measures the accuracy of the answer against a ground truth.
    • rag_answer_relevancy: Measures how relevant the answer is to the question.
    • rag_answer_similarity: Measures the semantic similarity between the answer and the question.
    • rag_context_recall: Measures the proportion of relevant documents retrieved.
    • rag_context_precision: Measures the proportion of retrieved documents that are relevant.
    • rag_answer_accuracy: Measures the accuracy of the answer against a ground truth.
    • rag_context_relevance: Measures the relevance of the retrieved context to the question.
    • rag_response_groundedness: Measures the groundedness of the answer in the retrieved context.
    • rag_context_entity_recall: Measures the proportion of relevant entities mentioned in the question that are found in the retrieved context.
    • rag_noise_sensitivity: Measures the robustness of the answer to noise in the retrieved context.
    • (The list of supported RAGAS metrics can be found in rag_eval/evaluations/rag/rag_eval_launcher.py)

How Evaluation Works

Retriever Evaluation (RetrieverEvalLauncher)

  1. Initialization: Sets API keys, output directories. Modifies Haystack pipeline YAMLs (index_pipeline_yaml_file, query_pipeline_yaml_file) with dynamic parameters from the job config (e.g., model endpoints, Milvus details).
  2. Dataset Setup: Downloads/locates the dataset (e.g., BEIR datasets like fiqa).
  3. Evaluation (RetrieverEval):
    • Uses the configured (and potentially modified) Haystack pipelines to perform retrieval for each query in the dataset against the document corpus.
    • Saves the retrieved document IDs and scores.
    • Computes metrics like recall@k, nDCG@k using pytrec_eval.
  4. Cleanup: Optionally cleans up resources like Milvus collections.
  5. Results: Saves scores and the dataset augmented with retrieved contexts.

RAG Evaluation (RagEvalLauncher)

The RAG evaluation orchestrates several steps:

  1. API Key Setup: Manages API keys for various components (generator LLM, retriever models, judge LLM, judge embeddings).
  2. (Optional) Retrieval Step:
    • If a retriever is configured within the RAG pipeline target, it invokes the RetrieverEvalLauncher's logic (_execute_task) to perform retrieval.
    • The dataset with retrieved contexts from this step is then used for answer generation.
    • If no retriever is configured, the original dataset is used directly for generation (assuming it contains contexts, or generation is context-free).
  3. (Optional) Answer Generation Step (RAGAnswerGenerator):
    • If an LLM (target.rag.pipeline.model) is configured, this step generates answers.
    • It uses the (potentially context-augmented) dataset and the prompt_template to formulate prompts for the LLM.
    • The generated answers are added to the dataset.
  4. Answer Evaluation Step (RAGEvaluator):
    • This step evaluates the answers (either generated or from the input dataset).
    • It uses the configured ragas (and potentially other) metrics.
    • Requires judge_llm and judge_embeddings for many RAGAS metrics.
    • Computes metrics like rag_faithfulness, rag_answer_correctness, etc.
  5. Results Aggregation: Combines metrics from the retrieval step (if any) and the answer evaluation step into a final result.

Troubleshooting

  • API Key Errors: Ensure API keys are correctly set either in the YAML or as environment variables. Check the specific key names expected (e.g., NVIDIA_API_KEY if using null in YAML for NVIDIA endpoints).
  • File Not Found (Datasets/Configs): Verify all paths in the YAML config are correct and accessible from where you run the evaluation. For BEIR datasets specified by name (e.g., fiqa), ensure an internet connection is available for download if not already cached.
  • Output Directory Exists: Some tools require the output directory to not exist prior to a run. Check specific error messages.
  • Metric Calculation Issues:
    • For RAGAS metrics, ensure judge_llm and judge_embeddings (and their API keys/URLs) are correctly configured if the chosen metrics require them.
    • Consult the Ragas and Pytrec_eval documentation for specific metric requirements.
  • Haystack Pipeline Errors: If using custom Haystack pipelines, ensure they are correctly defined and all components are compatible. The tool modifies these pipelines with runtime parameters; check the modified versions in the output_dir for debugging.
  • Milvus Connection Issues: If using Milvus, ensure the milvus_host and milvus_port are correct and Milvus is running and accessible. For CI tests, Milvus Lite is used which doesn't require a separate Milvus server (uses a local database file specified by milvus_uri).