Linux / amd64
agentic-eval currently supports the evaluation of custom data using the Ragas (v0.2.14) metrics:
Currently we support loading data from a jsonl file with each data entry standing for a user interaction. The data schema must conform to ragas format. Three metrics require slightly different schema.
topic_adherence
metric evaluates the ability of the AI to stay on predefined domains during the interactions.
Each data entry should follow the schema below:
{
"user_input":
[
{"content": "", "type": "human"},
{"content": "", "type": "ai", "tool_calls": [{"name": "", "args": {"": ""}}]},
{"content": "", "type": "tool"},
{"content": "", "type": "ai"},
{"content": "", "type": "human"},
{"content": "", "type": "ai", "tool_calls": [{"name": "", "args": {"": ""}}]},
{"content": "", "type": "tool"},
{"content": "", "type": "ai"}
],
"reference_topics": ["science"]
}
Dynamic data template is supported for topic_adherence
metric. For the data, if you want to use different keys for each row in a dataset, you can use the following template for example when using messages
and reference
instead:
{
"user_input": "{{ item.messages | tojson }}",
"reference_topics": "{{ item.reference | tojson }}"
}
tool_call_accuracy
evaluates the performance of the LLM in identifying and calling the required tools to complete a given task.
Note
: This metric does not use an LLM judge.
Each data entry should follow the schema below:
{
"user_input":
[
{"content": "", "type": "human"},
{"content": "", "type": "ai", "tool_calls": [{"name": "", "args": {"": ""}}]},
{"content": "", "type": "tool"},
{"content": "", "type": "ai"},
{"content": "", "type": "human"},
{"content": "", "type": "ai", "tool_calls": [{"name": "", "args": {"": ""}}]},
{"content": "", "type": "tool"},
{"content": "", "type": "ai"}
],
"reference_tool_calls": [
{"name": "", "args": {"": ""}},
{"name": "", "args": {"": ""}},
]
}
Dynamic data template is supported for tool_call_accuracy
metric. For the data, if you want to use different keys for each row in a dataset, you can use the following template for example when using messages
and reference
instead:
{
"user_input": "{{ item.messages | tojson }}",
"reference_tool_calls": "{{ item.reference | tojson }}"
}
There are two types of metrics: agent_goal_accuracy_with_reference
and agent_goal_accuracy_without_reference
. They are used to evaluate the performance of the LLM in identifying and achieving the goals of the user.
Each data entry should follow the schema below:
{
"user_input":
[
{"content": "", "type": "human"},
{"content": "", "type": "ai", "tool_calls": [{"name": "", "args": {"": ""}}]},
{"content": "", "type": "tool"},
{"content": "", "type": "ai"},
{"content": "", "type": "human"},
{"content": "", "type": "ai", "tool_calls": [{"name": "", "args": {"": ""}}]},
{"content": "", "type": "tool"},
{"content": "", "type": "ai"}
],
"reference": ""
}
Remove reference
field if using agent_goal_accuracy_without_reference
Dynamic data template is supported for agent_goal_accuracy_with_reference
and agent_goal_accuracy_without_reference
metric. For the data, if you want to use different keys for each row in a dataset, you can use the following template for example when using messages
and groud_truth
instead:
{
"user_input": "{{ item.messages | tojson }}",
"reference": "{{ item.groud_truth | tojson }}"
}
answer_accuracy
measures the agreement between a model's response and a reference ground truth for a given question. This metrics was primarilly designed to evaluation RAG pipelines, but it can be used for Agents as well to evaluate the quality of the response.
Each data entry should follow the schema below:
{
"user_input": "",
"response": "",
"reference": ""
}
content
, type
, tool_calls
(only for ai type).name
and args
A programmatic way to convert example Ragas sample to a jsonl file is:
import json
sample: MultiTurnSample
sample_dict = sample.to_dict
with open(FILE_NAME, "w") as file:
file.write(json.dumps(sample_dict))
Reference to Ragas API for better understanding the schema:
topic_adherence
, agent_goal_accuracy_with_reference
, agent_goal_accuracy_without_reference
, answer_accuracy
use LLM underneath to do the evaluation. For using these four metrics, users are required to set up their LLM judge.
Two types of judges are supported: openai
and nvidia-nim
If no judge model is provided, the OpenAI gpt-4o
will be used as the default judge model. API_KEY
is required to access the OpenAI API.
The evaluation can be run using NVIDIA's Eval Factory framework. Here's how to set it up:
nv_eval run_eval --run_config /path/to/config.yml --output_dir /path/to/results
The evaluation runs are configured using a YAML file. Here's an example configuration:
config:
type: agentic_eval_tool_call_accuracy # or other metric types
params:
temperature: 1
parallelism: 10
max_new_tokens: 1024
max_retries: 10
request_timeout: 10
extra:
judge_sanity_check: True
dataset_path: "/path/to/dataset.jsonl"
data_template: "/path/to/template.json" # optional
type
: The metric type to evaluate (e.g., agentic_eval_tool_call_accuracy
)params
: General parameters for the evaluation runtemperature
: Controls randomness in the model's outputparallelism
: Number of parallel evaluation tasksmax_new_tokens
: Maximum number of tokens to generatemax_retries
: Maximum number of retry attempts for failed evaluationsrequest_timeout
: Timeout for API requests in secondsextra
: Additional metric-specific parametersjudge_sanity_check
: Enable/disable judge model sanity checksdataset_path
: Path to the dataset filedata_template
: Optional path to a data template file for custom data formatsnv_eval run_eval --run_config /path/to/test_answer_accuracy.yml --output_dir /path/to/results
nv_eval run_eval --run_config /path/to/test_agent_goal_acc_ref.yml --output_dir /path/to/results
nv_eval run_eval --run_config /path/to/test_topic_adherence.yml --output_dir /path/to/results
nv_eval run_eval --run_config /path/to/test_tool_call.yml --output_dir /path/to/results
When using custom data formats, you can specify a template file that defines how to map your data to the required schema:
{
"user_input": "{{ item.conversations | tojson }}",
"reference_tool_calls": "{{ item.reference | tojson }}"
}
--dataset_path Required. Path to a JSONL dataset file formatted according to the schemas above
--output_dir Optional. Directory where results will be saved
--metric_name Required. One of: topic_adherence, tool_call_accuracy, agent_goal_accuracy_with_reference,
agent_goal_accuracy_without_reference, answer_accuracy
--metric_mode Optional. Specific mode for `topic_adherence` (precision, recall, f1. default f1)
--judge_model_type Optional. One of: openai, nvidia-nim. Default: openai
--judge_model_args Optional. JSON string for configuring [ChatOpenAI](https://python.langchain.com/api_reference/openai/chat_models/langchain_openai.chat_models.base.ChatOpenAI.html) for the judge model
Example: {"model": "gpt-4o", "temperature": 1, "max_tokens": 1024}
--ragas_run_config Optional. JSON string for Ragas [RunConfig](https://docs.ragas.io/en/stable/references/run_config/#ragas.run_config.RunConfig)
Example: {"max_workers": 2}
--data_template Optional. JSON string for data template configuration
--judge_sanity_check Optional. Boolean to enable/disable judge model sanity check. Default: true
For setting up the judge model, you need to provide the api_key (api_key=123456
) either from --judge_model_args
or from environment variable API_KEY
.
agentic_eval \
--dataset_path data.jsonl \
--output_dir ./results \
--metric_name topic_adherence \
--metric_mode recall \
--judge_model_args '{"model": "gpt-4o", "temperature": 1, "max_tokens": 1024, "api_key": "YOUR_API_KEY"}'
agentic_eval \
--dataset_path data.jsonl \
--output_dir ./results \
--metric_name topic_adherence \
--metric_mode recall \
--judge_model_args '{"base_url": "https://meta-llama3-1-8b-instruct.dev.aire.nvidia.com/v1", "model": "gpt-4o", "temperature": 1, "max_tokens": 1024}'
agentic_eval \
--dataset_path data/adherence.jsonl \
--output_dir ./results \
--metric_name topic_adherence \
--metric_mode recall \
--judge_model_args '{"base_url": "https://integrate.api.nvidia.com/v1", "model": "nvdev/deepseek-ai/deepseek-r1", "temperature": 1, "max_tokens": 1024, "api_key": "YOUR_API_KEY"}'
if output_path
is set, then two results file will be saved there.
<metric_name>.jsonl
: it has scores for each data entryscores.jsonl
: the final score for all the data