# NVIDIA Evals Factory The goal of NVIDIA Evals Factory is to advance and refine state-of-the-art methodologies for model evaluation, and deliver them as modular evaluation packages (evaluation containers and pip wheels) that teams can use as standardized building blocks. # Quick start guide NVIDIA Evals Factory provide you with evaluation clients, that are specifically built to evaluate model endpoints using our Standard API. ## Launching an evaluation for an LLM 1. (Optional) Set a token to your API endpoint if it's protected ```bash export MY_API_KEY="your_api_key_here" ``` 2. List the available evaluations: ```bash $ eval-factory ls Available tasks: * ai2d_judge (in vlmevalkit) * chartqa (in vlmevalkit) * mathvista-mini (in vlmevalkit) * mmmu_judge (in vlmevalkit) * ocrbench (in vlmevalkit) * slidevqa (in vlmevalkit) ... ``` 3. Run the evaluation of your choice: ```bash eval-factory run_eval \ --eval_type ocrbench \ --model_id microsoft/phi-4-multimodal-instruct \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ --model_type vlm \ --api_key_name MY_API_KEY \ --output_dir /workspace/results ``` 4. Gather the results ```bash cat /workspace/results/results.yml ``` # Command-Line Tool Each package comes pre-installed with a set of command-line tools, designed to simplify the execution of evaluation tasks. Below are the available commands and their usage for the `vlmevalkit`: ## Commands ### 1. **List Evaluation Types** ```bash eval-factory ls ``` Displays the evaluation types available within the harness. ### 2. **Run an evaluation** The `eval-factory run_eval` command executes the evaluation process. Below are the flags and their descriptions: ### Required flags * `--eval_type ` The type of evaluation to perform * `--model_id ` The name or identifier of the model to evaluate. * `--model_url ` The API endpoint where the model is accessible. * `--model_type ` The type of the model to evaluate, currently either "chat", "completions", or "vlm". * `--output_dir ` The directory to use as the working directory for the evaluation. The results, including the results.yml output file, will be saved here. ### Optional flags * `--api_key_name ` The name of the environment variable that stores the Bearer token for the API, if authentication is required. * `--run_config ` Specifies the path to a YAML file containing the evaluation definition. ### Example ```bash core_evals_vlmevalkit run_eval \ --eval_type ocrbench \ --model_id my_model \ --model_type vlm \ --model_url http://localhost:8000/v1/chat/completions \ --output_dir ./evaluation_results ``` If the model API requires authentication, set the API key in an environment variable and reference it using the `--api_key_name` flag: ```bash export MY_API_KEY="your_api_key_here" core_evals_vlmevalkit run_eval \ --eval_type ocrbench \ --model_id my_model \ --model_type vlm \ --model_url http://localhost:8000/v1/chat/completions \ --api_key_name MY_API_KEY \ --output_dir ./evaluation_results ``` # Configuring evaluations via YAML Evaluations in NVIDIA Evals Factory are configured using YAML files that define the parameters and settings required for the evaluation process. These configuration files follow a standard API which ensures consistency across evaluations. Example of a YAML config: ```yaml config: type: ocrbench params: parallelism: 50 limit_samples: 20 target: api_endpoint: model_id: microsoft/phi-4-multimodal-instruct type: vlm url: https://integrate.api.nvidia.com/v1/chat/completions api_key: NVIDIA_API_KEY ``` The priority of overrides is as follows: 1. command line arguments 2. user config (as seen above) 3. task defaults (defined per task type) 4. framework defaults `--dry_run` option allows you to print the final run configuration and command without executing the evaluation. ### Example: ```bash core_evals_vlmevalkit run_eval \ --eval_type ocrbench \ --model_id my_model \ --model_type vlm \ --model_url http://localhost:8000/v1/chat/completions \ --output_dir .evaluation_results \ --dry_run ``` Output: ```bash Rendered config: command: "cat > {{config.output_dir}}/vlmeval_config.json << 'EOF'\n{\n \"model\"\ : {\n \"{{target.api_endpoint.model_id.split('/')[-1]}}\": {\n \"class\"\ : \"CustomOAIEndpoint\",\n \"model\": \"{{target.api_endpoint.model_id}}\"\ ,\n \"api_base\": \"{{target.api_endpoint.url}}\",\n \"api_key_var_name\"\ : \"{{target.api_endpoint.api_key}}\",\n \"max_tokens\": {{config.params.max_new_tokens}},\n\ \ \"temperature\": {{config.params.temperature}},{% if config.params.top_p\ \ is not none %}\n \"top_p\": {{config.params.top_p}},{% endif %}\n \"\ retry\": {{config.params.max_retries}},\n \"timeout\": {{config.params.request_timeout}}{%\ \ if config.params.extra.wait is defined %},\n \"wait\": {{config.params.extra.wait}}{%\ \ endif %}{% if config.params.extra.img_size is defined %},\n \"img_size\"\ : {{config.params.extra.img_size}}{% endif %}{% if config.params.extra.img_detail\ \ is defined %},\n \"img_detail\": \"{{config.params.extra.img_detail}}\"{%\ \ endif %}{% if config.params.extra.system_prompt is defined %},\n \"system_prompt\"\ : \"{{config.params.extra.system_prompt}}\"{% endif %}{% if config.params.extra.verbose\ \ is defined %},\n \"verbose\": {{config.params.extra.verbose}}{% endif %}\n\ \ }\n },\n \"data\": {\n \"{{config.params.extra.dataset.name}}\": {\n \ \ \"class\": \"{{config.params.extra.dataset.class}}\",\n \"dataset\":\ \ \"{{config.params.extra.dataset.name}}\",\n \"model\": \"{{target.api_endpoint.model_id}}\"\ \n }\n }\n}\nEOF\npython -m vlmeval.run \\\n --config {{config.output_dir}}/vlmeval_config.json\ \ \\\n --work-dir {{config.output_dir}} \\\n --api-nproc {{config.params.parallelism}}\ \ \\\n {%- if config.params.extra.judge is defined %}\n --judge {{config.params.extra.judge.model}}\ \ \\\n --judge-args '{{config.params.extra.judge.args}}' \\\n {%- endif %}\n \ \ {% if config.params.limit_samples is not none %}--first-n {{config.params.limit_samples}}{%\ \ endif %}\n" framework_name: vlmevalkit pkg_name: vlmeval config: output_dir: .evaluation_results params: limit_samples: null max_new_tokens: 2048 max_retries: 5 parallelism: 4 task: null temperature: 0.0 request_timeout: 60 top_p: null extra: dataset: name: OCRBench class: OCRBench supported_endpoint_types: - vlm type: ocrbench target: api_endpoint: api_key: null model_id: my_model stream: null type: vlm url: http://localhost:8000/v1/chat/completions Rendered command: cat > .evaluation_results/vlmeval_config.json << 'EOF' { "model": { "my_model": { "class": "CustomOAIEndpoint", "model": "my_model", "api_base": "http://localhost:8000/v1/chat/completions", "api_key_var_name": "None", "max_tokens": 2048, "temperature": 0.0, "retry": 5, "timeout": 60 } }, "data": { "OCRBench": { "class": "OCRBench", "dataset": "OCRBench", "model": "my_model" } } } EOF python -m vlmeval.run \ --config .evaluation_results/vlmeval_config.json \ --work-dir .evaluation_results \ --api-nproc 4 \ ``` # FAQ ## Deploying a model as an endpoint NVIDIA Evals Factory utilize a client-server communication architecture to interact with the model. As a prerequisite, the **model must be deployed as an endpoint with a NIM-compatible API**. Users have the flexibility to deploy their model using their own infrastructure and tooling. Servers with APIs that conform to the OpenAI/NIM API standard are expected to work seamlessly out of the box. ### 3rd Party Source Code Users can download the third party source code through the URL provided in the container's README located in workdir.