# NVIDIA Evals Factory The goal of NVIDIA Evals Factory is to advance and refine state-of-the-art methodologies for model evaluation, and deliver them as modular evaluation packages (evaluation containers and pip wheels) that teams can use as standardized building blocks. # Quick start guide NVIDIA Evals Factory provide you with evaluation clients, that are specifically built to evaluate model endpoints using our Standard API. ## Launching an evaluation for an LLM 1. (Optional) Set a token to your API endpoint if it's protected ```bash export MY_API_KEY="your_api_key_here" ``` 2. List the available evaluations: ```bash $ core_evals_garak ls Available tasks: * garak (in garak) ... ``` 3. Run the evaluation: ```bash core_evals_garak run_eval \ --eval_type garak \ --model_id microsoft/phi-4-mini-instruct \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ --model_type chat \ --api_key_name MY_API_KEY \ --output_dir /workspace/results ``` 4. Gather the results ```bash cat /workspace/results/results.yml ``` # Command-Line Tool Each package comes pre-installed with a set of command-line tools, designed to simplify the execution of evaluation tasks. Below are the available commands and their usage for the `garak`: ## Commands ### 1. **List Evaluation Types** ```bash core_evals_garak ls ``` Displays the evaluation types available within the harness. ### 2. **Run an evaluation** The `core_evals_garak run_eval` command executes the evaluation process. Below are the flags and their descriptions: ### Required flags * `--eval_type ` The type of evaluation to perform * `--model_id ` The name or identifier of the model to evaluate. * `--model_url ` The API endpoint where the model is accessible. * `--model_type ` The type of the model to evaluate, currently either "chat", "completions", or "vlm". * `--output_dir ` The directory to use as the working directory for the evaluation. The results, including the results.yml output file, will be saved here. Make sure to use the absolute path. ### Optional flags * `--api_key_name ` The name of the environment variable that stores the Bearer token for the API, if authentication is required. * `--run_config ` Specifies the path to a YAML file containing the evaluation definition. ### Example ```bash core_evals_garak run_eval \ --eval_type garak \ --model_id my_model \ --model_type chat \ --model_url http://localhost:8000/v1/chat/completions \ --output_dir /workspace/evaluation_results ``` If the model API requires authentication, set the API key in an environment variable and reference it using the `--api_key_name` flag: ```bash export MY_API_KEY="your_api_key_here" core_evals_garak run_eval \ --eval_type garak \ --model_id my_model \ --model_type chat \ --model_url http://localhost:8000/v1/chat/completions \ --api_key_name MY_API_KEY \ --output_dir /workspace/evaluation_results ``` # Configuring evaluations via YAML Evaluations in NVIDIA Evals Factory are configured using YAML files that define the parameters and settings required for the evaluation process. These configuration files follow a standard API which ensures consistency across evaluations. Example of a YAML config: ```yaml config: type: garak params: parallelism: 50 limit_samples: 20 extra: probes: atkgen.Tox target: api_endpoint: model_id: microsoft/phi-4-mini-instruct type: chat url: https://integrate.api.nvidia.com/v1/chat/completions api_key: NVIDIA_API_KEY ``` The priority of overrides is as follows: 1. command line arguments 2. user config (as seen above) 3. task defaults (defined per task type) 4. framework defaults `--dry_run` option allows you to print the final run configuration and command without executing the evaluation. ### Example: ```bash core_evals_garak run_eval \ --eval_type garak \ --model_id my_model \ --model_type chat \ --model_url http://localhost:8000/v1/chat/completions \ --output_dir /workspace/evaluation_results \ --dry_run ``` Output: ```bash Rendered config: command: "cat > garak_config.yaml << 'EOF'\nplugins:\n probe_spec: {{config.params.extra.probes}}\n\ \ extended_detectors: true\n model_type: {% if target.api_endpoint.type == \"\ completions\" %}nim.NVOpenAICompletion{% elif target.api_endpoint.type == \"chat\"\ \ %}nim.NVOpenAIChat{% endif %}\n model_name: {{target.api_endpoint.model_id}}\n\ \ generators:\n nim:\n uri: {{target.api_endpoint.url | replace('/chat/completions',\ \ '') | replace('/completions', '')}}\n {% if config.params.temperature is\ \ not none %}temperature: {{config.params.temperature}}{% endif %}\n {% if\ \ config.params.top_p is not none %}top_p: {{config.params.top_p}}{% endif %}\n\ \ {% if config.params.max_new_tokens is not none %}max_tokens: {{config.params.max_new_tokens}}{%\ \ endif %}\nsystem:\n parallel_attempts: {{config.params.parallelism}}\n lite:\ \ false\nEOF\n{% if target.api_endpoint.api_key is not none %}\nexport NIM_API_KEY=${{target.api_endpoint.api_key}}\ \ &&\n{% else %}\nexport NIM_API_KEY=dummy &&\n{% endif %}\nexport XDG_DATA_HOME={{config.output_dir}}\ \ &&\ngarak --config garak_config.yaml --report_prefix=results\n" framework_name: garak pkg_name: garak config: output_dir: /workspace/evaluation_results params: limit_samples: null max_new_tokens: 150 max_retries: null parallelism: 32 task: garak temperature: 0.1 request_timeout: null top_p: 0.7 extra: probes: ansiescape.AnsiEscaped,ansiescape.AnsiRaw,atkgen.Tox,continuation.ContinueSlursReclaimedSlurs,dan.Ablation_Dan_11_0,dan.AutoDANCached,dan.DanInTheWild,divergence.Repeat,encoding.InjectAscii85,encoding.InjectBase16,encoding.InjectBase2048,encoding.InjectBase32,encoding.InjectBase64,encoding.InjectBraille,encoding.InjectEcoji,encoding.InjectHex,encoding.InjectMorse,encoding.InjectNato,encoding.InjectROT13,encoding.InjectUU,encoding.InjectZalgo,exploitation.JinjaTemplatePythonInjection,exploitation.SQLInjectionEcho,goodside.Tag,goodside.ThreatenJSON,goodside.WhoIsRiley,grandma.Slurs,grandma.Substances,grandma.Win10,grandma.Win11,latentinjection.LatentInjectionFactSnippetEiffel,latentinjection.LatentInjectionFactSnippetLegal,latentinjection.LatentInjectionReport,latentinjection.LatentInjectionResume,latentinjection.LatentInjectionTranslationEnFr,latentinjection.LatentInjectionTranslationEnZh,latentinjection.LatentJailbreak,latentinjection.LatentWhois,latentinjection.LatentWhoisSnippet,leakreplay.GuardianCloze,leakreplay.GuardianComplete,leakreplay.LiteratureCloze,leakreplay.LiteratureComplete,leakreplay.NYTCloze,leakreplay.NYTComplete,leakreplay.PotterCloze,leakreplay.PotterComplete,lmrc.Bullying,lmrc.Deadnaming,lmrc.QuackMedicine,lmrc.SexualContent,lmrc.Sexualisation,lmrc.SlurUsage,malwaregen.Evasion,malwaregen.Payload,malwaregen.SubFunctions,malwaregen.TopLevel,misleading.FalseAssertion,packagehallucination.JavaScript,packagehallucination.Python,packagehallucination.Ruby,packagehallucination.Rust,phrasing.FutureTense,phrasing.PastTense,promptinject.HijackHateHumans,promptinject.HijackKillHumans,promptinject.HijackLongPrompt,realtoxicityprompts.RTPBlank,snowball.GraphConnectivity,suffix.GCGCached,tap.TAPCached,topic.WordnetControversial,xss.ColabAIDataLeakage,xss.MarkdownImageExfil,xss.MdExfil20230929,xss.StringAssemblyDataExfil supported_endpoint_types: - chat - completions type: garak target: api_endpoint: api_key: null model_id: my_model stream: null type: chat url: http://localhost:8000/v1/chat/completions Rendered command: cat > garak_config.yaml << 'EOF' plugins: probe_spec: ansiescape.AnsiEscaped,ansiescape.AnsiRaw,atkgen.Tox,continuation.ContinueSlursReclaimedSlurs,dan.Ablation_Dan_11_0,dan.AutoDANCached,dan.DanInTheWild,divergence.Repeat,encoding.InjectAscii85,encoding.InjectBase16,encoding.InjectBase2048,encoding.InjectBase32,encoding.InjectBase64,encoding.InjectBraille,encoding.InjectEcoji,encoding.InjectHex,encoding.InjectMorse,encoding.InjectNato,encoding.InjectROT13,encoding.InjectUU,encoding.InjectZalgo,exploitation.JinjaTemplatePythonInjection,exploitation.SQLInjectionEcho,goodside.Tag,goodside.ThreatenJSON,goodside.WhoIsRiley,grandma.Slurs,grandma.Substances,grandma.Win10,grandma.Win11,latentinjection.LatentInjectionFactSnippetEiffel,latentinjection.LatentInjectionFactSnippetLegal,latentinjection.LatentInjectionReport,latentinjection.LatentInjectionResume,latentinjection.LatentInjectionTranslationEnFr,latentinjection.LatentInjectionTranslationEnZh,latentinjection.LatentJailbreak,latentinjection.LatentWhois,latentinjection.LatentWhoisSnippet,leakreplay.GuardianCloze,leakreplay.GuardianComplete,leakreplay.LiteratureCloze,leakreplay.LiteratureComplete,leakreplay.NYTCloze,leakreplay.NYTComplete,leakreplay.PotterCloze,leakreplay.PotterComplete,lmrc.Bullying,lmrc.Deadnaming,lmrc.QuackMedicine,lmrc.SexualContent,lmrc.Sexualisation,lmrc.SlurUsage,malwaregen.Evasion,malwaregen.Payload,malwaregen.SubFunctions,malwaregen.TopLevel,misleading.FalseAssertion,packagehallucination.JavaScript,packagehallucination.Python,packagehallucination.Ruby,packagehallucination.Rust,phrasing.FutureTense,phrasing.PastTense,promptinject.HijackHateHumans,promptinject.HijackKillHumans,promptinject.HijackLongPrompt,realtoxicityprompts.RTPBlank,snowball.GraphConnectivity,suffix.GCGCached,tap.TAPCached,topic.WordnetControversial,xss.ColabAIDataLeakage,xss.MarkdownImageExfil,xss.MdExfil20230929,xss.StringAssemblyDataExfil extended_detectors: true model_type: nim.NVOpenAIChat model_name: my_model generators: nim: uri: http://localhost:8000/v1 temperature: 0.1 top_p: 0.7 max_tokens: 150 system: parallel_attempts: 32 lite: false EOF export NIM_API_KEY=dummy && export XDG_DATA_HOME=/workspace/evaluation_results && garak --config garak_config.yaml --report_prefix=results ``` # FAQ ## Deploying a model as an endpoint NVIDIA Evals Factory utilize a client-server communication architecture to interact with the model. As a prerequisite, the **model must be deployed as an endpoint with a NIM-compatible API**. Users have the flexibility to deploy their model using their own infrastructure and tooling. Servers with APIs that conform to the OpenAI/NIM API standard are expected to work seamlessly out of the box. ### 3rd Party Source Code Users can download the third party source code through the URL provided in the container's README located in workdir.