# NVIDIA Evals Factory The goal of NVIDIA Evals Factory is to advance and refine state-of-the-art methodologies for model evaluation, and deliver them as modular evaluation packages (evaluation containers and pip wheels) that teams can use as standardized building blocks. # Quick start guide NVIDIA Evals Factory provide you with evaluation clients, that are specifically built to evaluate model endpoints using our Standard API. ## Launching an evaluation for an LLM 1. (Optional) Set a token to your API endpoint if it's protected ```bash export MY_API_KEY="your_api_key_here" ``` 2. List the available evaluations: ```bash $ eval-factory ls Available tasks: * bfclv2 (in bfcl) * bfclv2_ast (in bfcl) * bfclv3 (in bfcl) * bfclv3_ast (in bfcl) ... ``` 3. Run the evaluation of your choice: ```bash eval-factory run_eval \ --eval_type bfclv3_ast \ --model_id meta/llama-3.1-70b-instruct \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ --model_type chat \ --api_key_name MY_API_KEY \ --output_dir /workspace/results ``` 4. Gather the results ```bash cat /workspace/results/results.yml ``` # Command-Line Tool Each package comes pre-installed with a set of command-line tools, designed to simplify the execution of evaluation tasks. Below are the available commands and their usage for the `bfcl` (`bfcl`): ## Commands ### 1. **List Evaluation Types** ```bash eval-factory ls ``` Displays the evaluation types available within the harness. ### 2. **Run an evaluation** The `eval-factory run_eval` command executes the evaluation process. Below are the flags and their descriptions: ### Required flags * `--eval_type ` The type of evaluation to perform * `--model_id ` The name or identifier of the model to evaluate. * `--model_url ` The API endpoint where the model is accessible. * `--model_type ` The type of the model to evaluate, currently either "chat", "completions", or "vlm". * `--output_dir ` The directory to use as the working directory for the evaluation. The results, including the results.yml output file, will be saved here. ### Optional flags * `--api_key_name ` The name of the environment variable that stores the Bearer token for the API, if authentication is required. * `--run_config ` Specifies the path to a YAML file containing the evaluation definition. ### Example ```bash core_evals_bfcl run_eval \ --eval_type bfclv3_ast \ --model_id my_model \ --model_type chat \ --model_url http://localhost:8000 \ --output_dir ./evaluation_results ``` If the model API requires authentication, set the API key in an environment variable and reference it using the `--api_key_name` flag: ```bash export MY_API_KEY="your_api_key_here" core_evals_bfcl run_eval \ --eval_type bfclv3_ast \ --model_id my_model \ --model_type chat \ --model_url http://localhost:8000 \ --api_key_name MY_API_KEY \ --output_dir ./evaluation_results ``` # Configuring evaluations via YAML Evaluations in NVIDIA Evals Factory are configured using YAML files that define the parameters and settings required for the evaluation process. These configuration files follow a standard API which ensures consistency across evaluations. Example of a YAML config: ```yaml config: type: bfclv3_ast params: parallelism: 50 limit_samples: 20 target: api_endpoint: model_id: meta/llama-3.1-8b-instruct type: chat url: https://integrate.api.nvidia.com/v1/chat/completions api_key: NVIDIA_API_KEY ``` The priority of overrides is as follows: 1. command line arguments 2. user config (as seen above) 3. task defaults (defined per task type) 4. framework defaults `--dry_run` option allows you to print the final run configuration and command without executing the evaluation. ### Example: ```bash core_evals_bfcl run_eval \ --eval_type bfclv3_ast \ --model_id my_model \ --model_type chat \ --model_url http://localhost:8000 \ --output_dir .evaluation_results \ --dry_run ``` Output: ```bash Rendered config: command: '{% if target.api_endpoint.api_key is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key}}{% endif %} bfcl generate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} --num-threads {{config.params.parallelism}} && {% if target.api_endpoint.api_key is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key}}{% endif %} bfcl evaluate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --score-dir {{config.output_dir}} ' framework_name: bfcl pkg_name: bfcl config: output_dir: .evaluation_results params: limit_samples: null max_new_tokens: null max_retries: null parallelism: 10 task: multi_turn,ast temperature: null timeout: null top_p: null extra: {} supported_endpoint_types: - llm - vlm type: bfclv3_ast target: api_endpoint: api_key: null model_id: my_model stream: null type: chat url: http://localhost:8000 Rendered command: bfcl generate --model my_model --test-category multi_turn,ast --model-mapping oai --result-dir .evaluation_results --model-args base_url=http://localhost:8000 --num-threads 10 && bfcl evaluate --model my_model --test-category multi_turn,ast --model-mapping oai --result-dir .evaluation_results --score-dir .evaluation_results ``` # FAQ ## BFCL only - API Keys for Executable Test Categories If you want to run executable test categories, you must provide API keys. Add the keys to your `.env` file, so that the placeholder values used in questions/params/answers can be replaced with real data. There are 4 API keys to include: 1. RAPID-API Key: - Yahoo Finance: - Real Time Amazon Data : - Urban Dictionary: - Covid 19: - Time zone by Location: All the Rapid APIs we use have free tier usage. You need to **subscribe** to those API providers in order to have the executable test environment setup but it will be _free of charge_! 2. Exchange Rate API: 3. OMDB API: 4. Geocode API: ## Deploying a model as an endpoint NVIDIA Evals Factory utilize a client-server communication architecture to interact with the model. As a prerequisite, the **model must be deployed as an endpoint with a NIM-compatible API**. Users have the flexibility to deploy their model using their own infrastructure and tooling. Servers with APIs that conform to the OpenAI/NIM API standard are expected to work seamlessly out of the box. ### 3rd Party Source Code Users can download the third party source code through the URL provided in the container's README located in workdir.