NemoGuard JailbreakDetect

Ardennes is a random forest model trained by NVIDIA on snowflake-arctic-embed-m-long embeddings to detect attempts to jailbreak large language models. At the time of release, it is the best known publicly available model for detecting LLM jailbreak attempts.

Additional details about the model, including comparisons to other public models are available in the accompanying paper, accepted to the 2025 AAAI workshop on AI for Cyber Security (AICS).

Ardennes NIM Usage

Setup

One-time access setup, as needed:

export NGC_API_KEY=<YOUR NGC API KEY>
docker login nvcr.io  # ensure you login with the right key. Sometimes, if you've used another key in the past, this will just succeed without asking you for your new key. In this case, delete the config file it caches creds in, and try again.

# Username: $oauthtoken
# Password: <NGC_API_KEY>

Serving the model as a NIM

We provide the Ardennes model as an NVIDIA NIM, so you can simply pull the image from docker and run it.

#!/bin/bash

export NGC_API_KEY=<your NGC personal key with access to the "nvstaging/nim" org/team>
export NIM_IMAGE='nvcr.io/nvstaging/nim/ardennes-jailbreak-arctic-nim:v0.1'
export MODEL_NAME='ardennes-jailbreak-arctic'
docker pull $NIM_IMAGE

And go!

docker run -it --name=$MODEL_NAME \
    --gpus=all --runtime=nvidia \
    -e NGC_API_KEY="$NGC_API_KEY" \
    --expose 8000 \
    $NIM_IMAGE

Running inference with the NIM

The running NIM container exposes a standard REST API and you can send POST requests to the v1/classify endpoint as JSON to get model responses.

$ curl --data '{"input": "hello this is a test"}' --header "Content-Type: application/json" --header "Accept: application/json" http://0.0.0.0:8000/v1/classify

This will return a JSON dictionary with the model’s prediction of whether or not the provided input is a jailbreaking attempt.

{"jailbreak": false, "score": -0.9921652427737031}

You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.