Linux / amd64
This toolkit provides a standardized way to run RAG (Retrieval Augmented Generation) and Retriever evaluations.
This toolkit allows users to evaluate:
Evaluations are configured using YAML files, specifying datasets, models, metrics, and other parameters.
milvus_host
, milvus_port
).Prepare your Evaluation Job File(s): You'll need a YAML configuration file for your evaluation. See Evaluation Job Config for details.
Run the evaluation:
There are two main ways to run the evaluation:
Option 1: Running inside a benchmark container (using Docker)
docker run \
--network milvus-network \
-e NVIDIA_API_KEY="YOUR_NVIDIA_API_KEY" \
# Replace YOUR_IMAGE_ID_OR_TAG with your actual Docker image ID or tag
YOUR_IMAGE_ID_OR_TAG \
nv_eval run_eval --run_config /workspace/tests/rag/test_tc9_answer_generation_answer_evaluation_rags.yml --output_dir /workspace/results
--network milvus-network
: Connects the container to a Docker network named milvus-network
. This is necessary if your evaluation uses a Milvus instance running in another Docker container on this network.eval_factory_pipeline/tests/rag/test_tc5_e2e_squad.yml
.-e NVIDIA_API_KEY="YOUR_NVIDIA_API_KEY"
: Passes your NVIDIA API key as an environment variable to the container. Replace "YOUR_NVIDIA_API_KEY"
with your actual key.YOUR_IMAGE_ID_OR_TAG
: The ID or tag of your evaluation Docker image.nv_eval run_eval
command is then executed inside the container, referencing the mounted configuration file.Option 2: Running in a local Python environment
Use the core-evals-rag-retriever-eval
command-line interface.
core-evals-rag-retriever-eval run_eval --run_config eval_factory_pipeline/tests/retriever/test_embedding_only_nfcorpus.yml --output_dir output/test_embedding_only_nfcorpus
Results will be saved to the <output_dir>/results.yml
file. The structure and content of the results will depend on the evaluation type (RAG or Retriever) and the metrics configured.
Example output structure:
command: "export API_KEY=$NVIDIA_API_KEY && rag_retriever_eval \\\n --api_endpoint\
\ '{\"api_key\": \"NVIDIA_API_KEY\", \"model_id\": \"nvdev/meta/llama-3.1-8b-instruct\"\
, \"stream\": null, \"type\": \"chat\", \"url\": \"https://integrate.api.nvidia.com/v1\"\
}' \\\n --output_dir output/test_tc5_final \\\n --pipeline '{\"context_ordering\"\
: \"desc\", \"params\": {\"prompt_template_path\": \"/workspace/tests/rag/templates/prompt_template.jinja\"\
}, \"retriever\": {\"pipeline\": {\"index_embedding_model\": {\"api_endpoint\":\
\ {\"api_key\": null, \"format\": \"nim\", \"model_id\": \"nvidia/nv-embedqa-e5-v5\"\
, \"url\": \"https://integrate.api.nvidia.com/v1\"}}, \"params\": {\"component_inputs_template\"\
: \"{\\\"embedder\\\": {\\\"text\\\": \\\"${query}\\\"} }\", \"index_pipeline_yaml_file\"\
: \"/workspace/tests/retriever/templates/dense_only/milvus_index_nim.yaml\"\
, \"milvus_collection_name\": \"rag_test\", \"milvus_host\": \"localhost\", \"milvus_password\"\
: \"\", \"milvus_port\": \"19530\", \"query_pipeline_yaml_file\": \"/workspace/tests/retriever/templates/dense_only/milvus_query_nim.yaml\"\
}, \"query_embedding_model\": {\"api_endpoint\": {\"api_key\": null, \"format\"\
: \"nim\", \"model_id\": \"nvidia/nv-embedqa-e5-v5\", \"url\": \"https://integrate.api.nvidia.com/v1\"\
}}, \"top_k\": 10}}}' \\\n --tasks '{\"rag\": {\"dataset\": {\"format\": \"squad\"\
, \"path\": \"/workspace/tests/datasets/fiqa_synthetic_squad.json\"\
}, \"metrics\": {\"rag_answer_relevancy\": {\"params\": {}, \"type\": \"ragas\"\
}, \"rag_faithfulness\": {\"params\": {}, \"type\": \"ragas\"}, \"retriever_ndcg_cut_10\"\
: {\"params\": {}, \"type\": \"pytrec_eval\"}, \"retriever_ndcg_cut_5\": {\"params\"\
: {}, \"type\": \"pytrec_eval\"}, \"retriever_recall_10\": {\"params\": {}, \"type\"\
: \"pytrec_eval\"}, \"retriever_recall_5\": {\"params\": {}, \"type\": \"pytrec_eval\"\
}}, \"params\": {\"judge_embeddings\": \"nvidia/nv-embedqa-e5-v5\", \"judge_embeddings_api_key\"\
: null, \"judge_embeddings_url\": \"https://integrate.api.nvidia.com/v1\", \"judge_llm\"\
: \"nvdev/meta/llama-3.1-8b-instruct\", \"judge_llm_api_key\": null, \"judge_llm_url\"\
: \"https://integrate.api.nvidia.com/v1\", \"judge_max_retries\": 5, \"judge_max_workers\"\
: 2, \"judge_request_timeout\": 120}, \"type\": \"rag\"}}' \\\n --type rag"
config:
output_dir: output/test_tc5_final
params:
extra:
pipeline:
context_ordering: desc
params:
prompt_template_path: /workspace/tests/rag/templates/prompt_template.jinja
retriever:
pipeline:
index_embedding_model:
api_endpoint:
api_key: null
format: nim
model_id: nvidia/nv-embedqa-e5-v5
url: https://integrate.api.nvidia.com/v1
params:
component_inputs_template: '{"embedder": {"text": "${query}"} }'
index_pipeline_yaml_file: /workspace/tests/retriever/templates/dense_only/milvus_index_nim.yaml
milvus_collection_name: rag_test
milvus_host: localhost
milvus_password: ''
milvus_port: '19530'
query_pipeline_yaml_file: /workspace/tests/retriever/templates/dense_only/milvus_query_nim.yaml
query_embedding_model:
api_endpoint:
api_key: null
format: nim
model_id: nvidia/nv-embedqa-e5-v5
url: https://integrate.api.nvidia.com/v1
top_k: 10
tasks:
rag:
dataset:
format: squad
path: /workspace/tests/datasets/fiqa_synthetic_squad.json
metrics:
rag_answer_relevancy:
params: {}
type: ragas
rag_faithfulness:
params: {}
type: ragas
retriever_ndcg_cut_10:
params: {}
type: pytrec_eval
retriever_ndcg_cut_5:
params: {}
type: pytrec_eval
retriever_recall_10:
params: {}
type: pytrec_eval
retriever_recall_5:
params: {}
type: pytrec_eval
params:
judge_embeddings: nvidia/nv-embedqa-e5-v5
judge_embeddings_api_key: null
judge_embeddings_url: https://integrate.api.nvidia.com/v1
judge_llm: nvdev/meta/llama-3.1-8b-instruct
judge_llm_api_key: null
judge_llm_url: https://integrate.api.nvidia.com/v1
judge_max_retries: 5
judge_max_workers: 2
judge_request_timeout: 120
type: rag
type: rag
git_hash: null
results:
groups: {}
tasks:
rag:
metrics:
rag_answer_relevancy:
scores:
answer_relevancy:
stats: {}
value: 0.572086403028116
rag_faithfulness:
scores:
faithfulness:
stats: {}
value: 0.7331979311011568
retriever_retriever.ndcg_cut_10:
scores:
ndcg_cut_10:
stats: {}
value: 0.9442803161530771
retriever_retriever.ndcg_cut_5:
scores:
ndcg_cut_5:
stats: {}
value: 0.9400758007845321
retriever_retriever.recall_10:
scores:
recall_10:
stats: {}
value: 0.9817073170731707
retriever_retriever.recall_5:
scores:
recall_5:
stats: {}
value: 0.9695121951219512
target:
api_endpoint:
api_key: NVIDIA_API_KEY
model_id: nvdev/meta/llama-3.1-8b-instruct
type: chat
url: https://integrate.api.nvidia.com/v1
See
eval_factory_pipeline/tests
for examples.
The evaluation job configuration is a YAML file that specifies the evaluation type, dataset, metrics, and target system to evaluate. Here's a breakdown of its structure:
config:
type: rag # Either 'rag' or 'retriever'
params:
extra:
tasks:
rag: # This key should match the 'type' above
type: rag
dataset:
format: ragas # Format of the dataset (e.g., 'ragas', 'squad', 'beir')
path: /path/to/dataset.jsonl # Path to the dataset file
params:
judge_llm: meta/llama-3.1-8b-instruct # LLM used for evaluation
judge_llm_url: https://integrate.api.nvidia.com/v1
judge_llm_api_key: null # Will use environment variable if null
judge_embeddings: nvidia/nv-embedqa-e5-v5 # Embedding model for evaluation
judge_embeddings_url: https://integrate.api.nvidia.com/v1
judge_embeddings_api_key: null
judge_request_timeout: 120
judge_max_retries: 5
judge_max_workers: 2
metrics:
rag_faithfulness: # Metric name
type: ragas # Calculation method: `pytrec_eval` or `ragas`
params: {} # Additional parameters for this metric
# ... more metrics ...
pipeline:
context_ordering: desc # How to order retrieved contexts: 'asc' or 'desc'
retriever: # Optional retriever configuration
# ... retriever details see below ...
target:
api_endpoint: # Generative LLM endpoint
url: https://integrate.api.nvidia.com/v1
model_id: meta/llama-3.1-8b-instruct
api_key: NVIDIA_API_KEY # Only here it should be an environment variable name
type: chat
target
SectionThis section defines the target endpoint for the generative LLM.
api_endpoint
: Configuration for the generative LLM endpoint.url
: The API endpoint URL.model_id
: The model identifier to use.api_key
: Must be provided via an environment variable name, not by value as in other sections.type
: The endpoint should be a chat
endpoint.config.params.extra.tasks
SectionThis section defines one or more evaluation tasks.
dataset
: Specifies the dataset format (squad
, beir
, ragas
, etc.) and path.beir
format, path
can be a BEIR dataset identifier (e.g., fiqa
) which the tool can download.squad
or ragas
, path
should be the file path to the dataset (e.g., /path/to/your/dataset.json
or /path/to/your/dataset.jsonl
).params
: Contains parameters for the evaluation process, especially for the "judge" models used in RAGAS metrics.judge_llm
, judge_llm_url
, judge_llm_api_key
: Configuration for the LLM used for judging (e.g., for faithfulness, answer_relevancy).judge_embeddings
, judge_embeddings_url
, judge_embeddings_api_key
: Configuration for the embedding model used by judge (e.g., for answer_similarity).metrics
: A dictionary of metrics to compute.retriever_recall_5
, rag_faithfulness
).type
: Specifies the metric calculation method (e.g., pytrec_eval
, ragas
).config.params.extra.pipeline
SectionThis section describes the system being evaluated.
params.prompt_template_path
: Path to a Jinja2 file defining how the query and retrieved documents are presented to the generative LLM.retriever.pipeline
:query_embedding_model
: Defines the model used to embed queries.index_embedding_model
: Defines the model used to embed documents for the index.reranker_model
: Optional. Configuration for a reranking model applied after initial retrieval.top_k
: Number of documents to retrieve.params
: Parameters specific to the retriever implementation (e.g., Haystack pipeline files, Milvus connection details).index_pipeline_yaml_file
, query_pipeline_yaml_file
: Paths to Haystack pipeline definitions.component_inputs_template
: Template for providing input to Haystack components.milvus_host
, milvus_port
, milvus_collection_name
: Details for connecting to a Milvus vector database.RAG pipeline example:
config:
type: rag
params:
extra:
...
pipeline:
context_ordering: desc # Optional: 'asc' or 'desc'
params:
prompt_template_path: ... # Path to a Jinja2 file defining how the query and retrieved documents are presented to the generative LLM endpoint.
retriever: # Retriever configuration (see retriever target below)
pipeline:
top_k: 10
query_embedding_model:
api_endpoint:
url: https://integrate.api.nvidia.com/v1
model_id: nvidia/nv-embedqa-e5-v5
api_key: null
format: nim
index_embedding_model: # Often same as query_embedding_model
api_endpoint:
url: https://integrate.api.nvidia.com/v1
model_id: nvidia/nv-embedqa-e5-v5
api_key: null
format: nim
params: # Retriever-specific parameters (e.g., for Haystack, Milvus)
index_pipeline_yaml_file: retriever_templates/dense_only/milvus_index_nim.yaml
query_pipeline_yaml_file: retriever_templates/dense_only/milvus_query_nim.yaml
component_inputs_template: '{"embedder": {"text": "${query}"} }' # For Haystack
milvus_host: 172.20.0.2
milvus_port: "19530"
milvus_collection_name: rag_test
# Note: If using Milvus-backed retrieval, ensure you have a Milvus server running and accessible
# with the specified host, port, and that the collection can be created or already exists
# as per your pipeline's requirements.
Retriever pipeline example:
target:
type: retriever_pipeline
retriever:
pipeline:
query_embedding_model:
api_endpoint:
url: https://integrate.api.nvidia.com/v1
model_id: nvidia/nv-embedqa-e5-v5
api_key: null # Set via env var or here
index_embedding_model: # Often same as query_embedding_model
api_endpoint:
url: https://integrate.api.nvidia.com/v1
model_id: nvidia/nv-embedqa-e5-v5
api_key: null # Set via env var or here
reranker_model: null # Optional: Configuration for a reranker model
top_k: 10
params: # Retriever-specific parameters (e.g., for Haystack, Milvus)
index_pipeline_yaml_file: retriever_templates/dense_only/milvus_index_nim.yaml
query_pipeline_yaml_file: retriever_templates/dense_only/milvus_query_nim.yaml
component_inputs_template: '{"embedder": {"text": "${query}"} }' # For Haystack
milvus_host: localhost
milvus_port: "19530"
milvus_collection_name: nfcorpus_test
# Note: If using Milvus-backed retrieval, ensure you have a Milvus server running and accessible
# with the specified host, port, and that the collection can be created or already exists
# as per your pipeline's requirements.
retriever_name: nim-retriever # Identifier for the retriever
retriever_type: nvidia-nemo-nim # Type of retriever
API keys for model endpoints (generation, embedding, judge LLMs, judge embeddings) can be:
api_key: "YOUR_API_KEY"
).null
in the YAML and provided via an environment variable. The specific environment variable name might depend on the toolkit's implementation, but NVIDIA_API_KEY
is a common convention for NVIDIA endpoints. Also, the launchers include logic to pick up default key from API_KEY
environment variable -- try it first.Calculated if a retriever is part of the RAG pipeline or for standalone retriever evaluations.
pytrec_eval
based:retriever_recall_K
(e.g., retriever_recall_5
, retriever_recall_10
)retriever_ndcg_cut_K
(e.g., retriever_ndcg_cut_5
, retriever_ndcg_cut_10
)pytrec_eval
like MAP, MRR, etc.Calculated during RAG evaluation, often using the Ragas library. These metrics typically require "judge" LLMs and "judge" embedding models.
ragas
based:rag_faithfulness
: Measures if the answer is supported by the retrieved context.rag_answer_correctness
: Measures the accuracy of the answer against a ground truth.rag_answer_relevancy
: Measures how relevant the answer is to the question.rag_answer_similarity
: Measures the semantic similarity between the answer and the question.rag_context_recall
: Measures the proportion of relevant documents retrieved.rag_context_precision
: Measures the proportion of retrieved documents that are relevant.rag_answer_accuracy
: Measures the accuracy of the answer against a ground truth.rag_context_relevance
: Measures the relevance of the retrieved context to the question.rag_response_groundedness
: Measures the groundedness of the answer in the retrieved context.rag_context_entity_recall
: Measures the proportion of relevant entities mentioned in the question that are found in the retrieved context.rag_noise_sensitivity
: Measures the robustness of the answer to noise in the retrieved context.rag_eval/evaluations/rag/rag_eval_launcher.py
)RetrieverEvalLauncher
)index_pipeline_yaml_file
, query_pipeline_yaml_file
) with dynamic parameters from the job config (e.g., model endpoints, Milvus details).fiqa
).RetrieverEval
):pytrec_eval
.RagEvalLauncher
)The RAG evaluation orchestrates several steps:
RetrieverEvalLauncher
's logic (_execute_task
) to perform retrieval.RAGAnswerGenerator
):target.rag.pipeline.model
) is configured, this step generates answers.prompt_template
to formulate prompts for the LLM.RAGEvaluator
):ragas
(and potentially other) metrics.judge_llm
and judge_embeddings
for many RAGAS metrics.rag_faithfulness
, rag_answer_correctness
, etc.NVIDIA_API_KEY
if using null
in YAML for NVIDIA endpoints).fiqa
), ensure an internet connection is available for download if not already cached.judge_llm
and judge_embeddings
(and their API keys/URLs) are correctly configured if the chosen metrics require them.output_dir
for debugging.milvus_host
and milvus_port
are correct and Milvus is running and accessible. For CI tests, Milvus Lite is used which doesn't require a separate Milvus server (uses a local database file specified by milvus_uri
).