agentic-eval

Overview

agentic-eval currently supports the evaluation of custom data using the Ragas (v0.2.14) metrics:

topic_adherence (link)
tool_call_accuracy (link)
agent_goal_accuracy (link)
- agent_goal_accuracy_with_reference (link)
- agent_goal_accuracy_without_reference (link)
answer_accuracy (link)

Data preparation

Currently we support loading data from a jsonl file with each data entry standing for a user interaction. The data schema must conform to ragas format. Three metrics require slightly different schema.

Topic Adherence (link)

topic_adherence metric evaluates the ability of the AI to stay on predefined domains during the interactions.

Each data entry should follow the schema below:

{
    "user_input": 
    [
        {"content": "", "type": "human"}, 
        {"content": "", "type": "ai", "tool_calls": [{"name": "", "args": {"": ""}}]}, 
        {"content": "", "type": "tool"}, 
        {"content": "", "type": "ai"}, 
        {"content": "", "type": "human"}, 
        {"content": "", "type": "ai", "tool_calls": [{"name": "", "args": {"": ""}}]}, 
        {"content": "", "type": "tool"}, 
        {"content": "", "type": "ai"}
    ], 
    
    "reference_topics": ["science"]
}

Dynamic Data Template

Dynamic data template is supported for topic_adherence metric. For the data, if you want to use different keys for each row in a dataset, you can use the following template for example when using messages and reference instead:

{
    "user_input": "{{ item.messages | tojson }}",
    "reference_topics": "{{ item.reference | tojson }}"
}

Tool Call Accuracy (link)

tool_call_accuracy evaluates the performance of the LLM in identifying and calling the required tools to complete a given task.

Note: This metric does not use an LLM judge.

Each data entry should follow the schema below:

{
    "user_input": 
    [
        {"content": "", "type": "human"}, 
        {"content": "", "type": "ai", "tool_calls": [{"name": "", "args": {"": ""}}]}, 
        {"content": "", "type": "tool"}, 
        {"content": "", "type": "ai"}, 
        {"content": "", "type": "human"}, 
        {"content": "", "type": "ai", "tool_calls": [{"name": "", "args": {"": ""}}]}, 
        {"content": "", "type": "tool"}, 
        {"content": "", "type": "ai"}
    ], 
    
    "reference_tool_calls": [
        {"name": "", "args": {"": ""}},
        {"name": "", "args": {"": ""}},
        ]
}

Dynamic Data Template

Dynamic data template is supported for tool_call_accuracy metric. For the data, if you want to use different keys for each row in a dataset, you can use the following template for example when using messages and reference instead:

{
    "user_input": "{{ item.messages | tojson }}",
    "reference_tool_calls": "{{ item.reference | tojson }}"
}

Agent Goal Accuracy (link)

There are two types of metrics: agent_goal_accuracy_with_reference and agent_goal_accuracy_without_reference. They are used to evaluate the performance of the LLM in identifying and achieving the goals of the user.

Each data entry should follow the schema below:

{
    "user_input": 
    [
        {"content": "", "type": "human"}, 
        {"content": "", "type": "ai", "tool_calls": [{"name": "", "args": {"": ""}}]}, 
        {"content": "", "type": "tool"}, 
        {"content": "", "type": "ai"}, 
        {"content": "", "type": "human"}, 
        {"content": "", "type": "ai", "tool_calls": [{"name": "", "args": {"": ""}}]}, 
        {"content": "", "type": "tool"}, 
        {"content": "", "type": "ai"}
    ], 
    
    "reference": ""
}

Remove reference field if using agent_goal_accuracy_without_reference

Dynamic Data Template

Dynamic data template is supported for agent_goal_accuracy_with_reference and agent_goal_accuracy_without_reference metric. For the data, if you want to use different keys for each row in a dataset, you can use the following template for example when using messages and groud_truth instead:

{
    "user_input": "{{ item.messages | tojson }}",
    "reference": "{{ item.groud_truth | tojson }}"
}

Answer Accuracy (link)

answer_accuracy measures the agreement between a model's response and a reference ground truth for a given question. This metrics was primarilly designed to evaluation RAG pipelines, but it can be used for Agents as well to evaluate the quality of the response.

Each data entry should follow the schema below:

{
    "user_input": "",
    "response": "",
    "reference": ""
}

Note

user input should be put in a list of conversations. There are three types of messages: human, ai and tool. Each conversation should contain content, type, tool_calls (only for ai type).
tool_calls should be a list of tool calls with name and args

A programmatic way to convert example Ragas sample to a jsonl file is:

import json
sample: MultiTurnSample

sample_dict = sample.to_dict

with open(FILE_NAME, "w") as file:
    file.write(json.dumps(sample_dict))

Reference to Ragas API for better understanding the schema:

HumanMessage: link
AIMessage: link
ToolMessage: link
ToolCall: link

Judge Model

topic_adherence, agent_goal_accuracy_with_reference, agent_goal_accuracy_without_reference, answer_accuracy use LLM underneath to do the evaluation. For using these four metrics, users are required to set up their LLM judge.

Two types of judges are supported: openai and nvidia-nim

Default Judge Model

If no judge model is provided, the OpenAI gpt-4o will be used as the default judge model. API_KEY is required to access the OpenAI API.

CLI tool

The evaluation can be run using NVIDIA's Eval Factory framework. Here's how to set it up:

Basic Command Structure

nv_eval run_eval --run_config /path/to/config.yml --output_dir /path/to/results

Configuration File

The evaluation runs are configured using a YAML file. Here's an example configuration:

config:
  type: agentic_eval_tool_call_accuracy  # or other metric types
  params:
    temperature: 1
    parallelism: 10
    max_new_tokens: 1024
    max_retries: 10
    request_timeout: 10
    extra:
      judge_sanity_check: True
      dataset_path: "/path/to/dataset.jsonl"
      data_template: "/path/to/template.json"  # optional

Configuration Parameters

type: The metric type to evaluate (e.g., agentic_eval_tool_call_accuracy)
params: General parameters for the evaluation run
- temperature: Controls randomness in the model's output
- parallelism: Number of parallel evaluation tasks
- max_new_tokens: Maximum number of tokens to generate
- max_retries: Maximum number of retry attempts for failed evaluations
- request_timeout: Timeout for API requests in seconds
extra: Additional metric-specific parameters
- judge_sanity_check: Enable/disable judge model sanity checks
- dataset_path: Path to the dataset file
- data_template: Optional path to a data template file for custom data formats

Example Commands for Different Metrics

Answer Accuracy

nv_eval run_eval --run_config /path/to/test_answer_accuracy.yml --output_dir /path/to/results

Agent Goal Accuracy (with reference)

nv_eval run_eval --run_config /path/to/test_agent_goal_acc_ref.yml --output_dir /path/to/results

Topic Adherence

nv_eval run_eval --run_config /path/to/test_topic_adherence.yml --output_dir /path/to/results

Tool Call Accuracy

nv_eval run_eval --run_config /path/to/test_tool_call.yml --output_dir /path/to/results

Data Template Example

When using custom data formats, you can specify a template file that defines how to map your data to the required schema:

{
    "user_input": "{{ item.conversations | tojson }}",
    "reference_tool_calls": "{{ item.reference | tojson }}"
}

Native CLI: agentic_eval

--dataset_path       Required. Path to a JSONL dataset file formatted according to the schemas above
--output_dir         Optional. Directory where results will be saved
--metric_name        Required. One of: topic_adherence, tool_call_accuracy, agent_goal_accuracy_with_reference, 
                     agent_goal_accuracy_without_reference, answer_accuracy
--metric_mode        Optional. Specific mode for `topic_adherence` (precision, recall, f1. default f1)
--judge_model_type   Optional. One of: openai, nvidia-nim. Default: openai
--judge_model_args   Optional. JSON string for configuring [ChatOpenAI](https://python.langchain.com/api_reference/openai/chat_models/langchain_openai.chat_models.base.ChatOpenAI.html) for the judge model
                     Example: {"model": "gpt-4o", "temperature": 1, "max_tokens": 1024}
--ragas_run_config   Optional. JSON string for Ragas [RunConfig](https://docs.ragas.io/en/stable/references/run_config/#ragas.run_config.RunConfig)
                     Example: {"max_workers": 2}
--data_template      Optional. JSON string for data template configuration
--judge_sanity_check Optional. Boolean to enable/disable judge model sanity check. Default: true

For setting up the judge model, you need to provide the api_key (api_key=123456) either from --judge_model_args or from environment variable API_KEY.

Example commands

Using OpenAI GPT-4o

agentic_eval \
  --dataset_path data.jsonl \
  --output_dir ./results \
  --metric_name topic_adherence \
  --metric_mode recall \
  --judge_model_args '{"model": "gpt-4o", "temperature": 1, "max_tokens": 1024, "api_key": "YOUR_API_KEY"}'

Using NVIDIA NIM

agentic_eval \
  --dataset_path data.jsonl \
  --output_dir ./results \
  --metric_name topic_adherence \
  --metric_mode recall \
  --judge_model_args '{"base_url": "https://meta-llama3-1-8b-instruct.dev.aire.nvidia.com/v1", "model": "gpt-4o", "temperature": 1, "max_tokens": 1024}'

Using NVIDIA Integrate API

agentic_eval \
  --dataset_path data/adherence.jsonl \
  --output_dir ./results \
  --metric_name topic_adherence \
  --metric_mode recall \
  --judge_model_args '{"base_url": "https://integrate.api.nvidia.com/v1", "model": "nvdev/deepseek-ai/deepseek-r1", "temperature": 1, "max_tokens": 1024, "api_key": "YOUR_API_KEY"}'

Results

if output_path is set, then two results file will be saved there.

<metric_name>.jsonl: it has scores for each data entry
scores.jsonl: the final score for all the data

NVIDIA Agentic Eval

agentic-eval

Overview

Data preparation

Topic Adherence (link)

Dynamic Data Template

Tool Call Accuracy (link)

Dynamic Data Template

Agent Goal Accuracy (link)

Dynamic Data Template

Answer Accuracy (link)

Note

Judge Model

Default Judge Model

CLI tool

Basic Command Structure

Configuration File

Configuration Parameters

Example Commands for Different Metrics

Answer Accuracy

Agent Goal Accuracy (with reference)

Topic Adherence

Tool Call Accuracy

Data Template Example

Native CLI: agentic_eval

Example commands

Using OpenAI GPT-4o

Using NVIDIA NIM

Using NVIDIA Integrate API

Results