llama-3.1-nemoguard-8b-content-safety

Llama 3.1 NemoGuard 8B Content Safety

Container

llama-3.1-nemoguard-8b-content-safety

Llama 3.1 NemoGuard 8B Content Safety

NVIDIA NIM for GPU accelerated Llama 3.1 NemoGuard 8B Content Safety inference through OpenAI compatible APIs

NVIDIA Developer Program NVIDIA AI Enterprise

NVIDIA AI Enterprise Supported NVIDIA NIM

Join or Subscribe to get accessSubscribe to the product below to access this premium content:

NVIDIA Developer ProgramJoin the Developer Program for access to free tools, support, and tech resources.

Get Access

NVIDIA AI EnterpriseAccelerate your AI agent development

Subscribe Now

Note: You can gain access to hundreds more GPU-optimized artifacts by creating a free NGC account.

Already Subscribed?Log in

Llama Nemotron Safety Guard V2 Overview

Description

Llama Nemotron Safety Guard V2, formerly known as Llama 3.1 NemoGuard 8B ContentSafety, is a content safety model trained on the Nemotron Content Safety Dataset V2 that moderates human-LLM interaction content and classifies user prompts and LLM responses as safe or unsafe. If the content is unsafe, the model additionally returns a response with a list of categories that the content violates. The base large language model (LLM) is the multilingual Llama-3.1-8B-Instruct model from Meta. NVIDIA’s optimized release is LoRa-tuned on approved datasets and better conforms NVIDIA’s content safety risk taxonomy and other safety risks in human-LLM interactions.

The model can be prompted using an instruction and a taxonomy of unsafe risks to be categorized. The instruction format for prompt moderation is shown below under input and output examples.

This container serves up the Llama-3.1-NemoGuard-8B-ContentSafety model as an Nvidia Inference Microservice (NIM).

The model can be loaded in two ways: with an optimized TRT-LLM engine that can yield major latency improvements, or as an automated fallback, using a vLLM inference engine.

The container components are ready for commercial/non-commercial use.

License/Terms of Use:

GOVERNING TERMS: Use of the NIM container is governed by the NVIDIA Software License Agreement and the Product-Specific Terms for NVIDIA AI Products; use of this model is governed by the NVIDIA Community Model License.

You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.

Deployment Geography:

Global

Release Date:

Build.Nvidia.com [07/02/2025]

Hugging Face [01/15/2025]

NGC [07/02/2025]

Program Classes:

The Llama Nemotron Safety Guard V2 Container includes the following model:

Model Name & Link	Use Case	How to Pull the Model
Llama Nemotron Safety Guard V2 Large Language Model	Intended to be deployed as a guardrail in an LLM system, to moderate human-LLM interaction content and classify user prompts and LLM responses as safe or unsafe	Automatic

Deployment Details:

This container serves up the Llama Nemotron Safety Guard V2 model as an Nvidia Inference Microservice (NIM).

The model can be loaded in two ways: with an optimized TRT-LLM engine that can yield major latency improvements, or as an automated fallback, using a vLLM inference engine.

One-time inference environment setup, if needed:

conda create -n evals python=3.10
conda activate evals
pip install torch==2.5.1 transformers==4.45.1 langchain==0.2.5 huggingface-hub==0.26.2

Note: you might also want to authenticate with HuggingFace using

huggingface-cli login --token <YOUR HF TOKEN>

Example inference using an NVIDIA NIM container with an optimized TRT-LLM engine

One-time access setup, as needed:

export NGC_API_KEY=<YOUR NGC API KEY>
docker login nvcr.io  # ensure you login with the right key. Sometimes, if you've used another key in the past, this will just succeed without asking you for your new key. In this case, delete the config file it caches creds in, and try again.

Steps to Serve the model as a NIM

We provide the Llama Nemotron Safety Guard V2 model as an Nvidia NIM which automatically serves optimized TRT-LLM inference engines of our model for your specific GPU (Supported GPUs: B200, H100, A100, L40S, A6000). This can yield impressive improvements over inference using a HuggingFace format checkpoint. The steps are very simple -- it's just a simple docker pull and docker run.

Bonus: Caching the optimized TRTLLM inference engines

If you'd like to not build TRTLLM engines from scratch every time you run the NIM container, you can cache it in the first run by just adding a flag to mount a local directory inside the docker to store the model cache.

To achieve this, you simply need to mount the folder containing the cached TRTLLM assets onto the docker container while running it using -v $LOCAL_NIM_CACHE:/opt/nim/.cache. See below instructions for the full command. Important: make sure that docker has permissions to write to the cache folder (sudo chmod 666 $LOCAL_NIM_CACHE).

export NGC_API_KEY=<your NGC personal key with access to the "nim/nvidia" org/team>
export MODEL_NAME="llama-3.1-nemoguard-8b-content-safety"
export NIM_IMAGE="nvcr.io/nim/nvidia/llama-3.1-nemoguard-8b-content-safety:latest"
docker pull $NIM_IMAGE

### Bind a $LOCAL_NIM_CACHE folder to "/opt/nim/.cache"
export LOCAL_NIM_CACHE=<PATH TO DIRECTORY WHERE YOU WANT TO SAVE TRTLLM ENGINE ASSETS>
mkdir -p $LOCAL_NIM_CACHE
sudo chmod 666 $LOCAL_NIM_CACHE

And go!

docker run -it --name=$MODEL_NAME \
    --gpus=all --runtime=nvidia \
    -e NGC_API_KEY="$NGC_API_KEY" \
    -e NIM_SERVED_MODEL_NAME=$MODEL_NAME \
    -e NIM_CUSTOM_MODEL_NAME=$MODEL_NAME \
    -v $LOCAL_NIM_CACHE:"/opt/nim/.cache/" \
    -u $(id -u) \
    -p 8123:8000 \
    $NIM_IMAGE

Steps to run inference with the NIM

The running NIM container exposes a standard LLM REST API and you can send POST requests to the v1/completions or the v1/chat/completions endpoints in the appropriate formats to get model responses.

python nemoguard_inference_example.py --nim_host <IP OF MACHINE WHERE NIM GOT HOSTED> --nim_port 8123 --nim_model_name=$MODEL_NAME

Here are the contents of the referenced inference scriptnemoguard_inference_example.py

import argparse
import json
from langchain.llms.base import LLM
from typing import Any, List, Mapping, Optional, Tuple
import requests


UNSAFE_CATEGORY_LIST_EXPANDED = """S1: Violence.
S2: Sexual.
S3: Criminal Planning/Confessions.
S4: Guns and Illegal Weapons.
S5: Controlled/Regulated Substances.
S6: Suicide and Self Harm.
S7: Sexual (minor).
S8: Hate/Identity Hate.
S9: PII/Privacy.
S10: Harassment.
S11: Threat.
S12: Profanity.
S13: Needs Caution.
S14: Other.
S15: Manipulation.
S16: Fraud/Deception.
S17: Malware.
S18: High Risk Gov Decision Making.
S19: Political/Misinformation/Conspiracy.
S20: Copyright/Trademark/Plagiarism.
S21: Unauthorized Advice.
S22: Illegal Activity.
S23: Immoral/Unethical.
"""


class NemoGuardContentSafetyModel(LLM):
    
    @property
    def _llm_type(self) -> str: return "custom"
    
    def _call(self, 
              prompt: str,
              stop: Optional[List[str]] = None,
              uri: str = "http://0.0.0.0:8123/v1/completions", 
              model_name: str = "llama-3.1-nemoguard-8b-content-safety") -> str:
        try:
            response = requests.post(
                uri,
                headers={
                    "Content-Type": "application/json",
                    "accept": "application/json",
                },
                json={
                    "model": model_name,
                    "prompt": prompt,
                    "max_tokens": 100,
                    "top_p": 1,
                    "n": 1,
                    "temperature": 0.0,
                    "stream": False,
                    "frequency_penalty": 0.0,
                },
            )
            if response.status_code != 200:
                raise Exception(f"Failed to get a legitimate from LLM. Status code: {response.status_code} {response.text}")
            return response.json()["choices"][0]["text"]
        except Exception as e:
            print(e)
            return "error"
        
    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        """Get the identifying parameters."""
        return {}


def get_prompt(user_message: str, response: Optional[str] = None) -> str:
    prompt = f"""Task: Check if there is unsafe content in the user or agent messages in the below conversation in accordance with our safety policy with the below categories. Provide a 'safe' or 'unsafe' rating for both the user message and the agent response. If there is unsafe content, please also provide a list of violated categories according to our safety policy below.

<BEGIN UNSAFE CONTENT CATEGORIES>
{UNSAFE_CATEGORY_LIST_EXPANDED}
<END UNSAFE CONTENT CATEGORIES>

<BEGIN CONVERSATION>

"""

    response_format = """Provide your safety assessment for the above conversation in the below output JSON format:
{{"User Safety": safe or unsafe rating for the user message, "Response Safety": safe or unsafe rating for the agent response. Omit if no agent response present. "Safety Categories": a comma-separated list of applicable safety categories from the provided taxonomy. Omit if all safe.}}

Do not include anything other than the output JSON in your response.
Output JSON: """
    
    if user_message is not None:
        prompt += f"user: {user_message}\n\n"

    if response is not None:
        prompt += f"response: agent: {response}\n\n"

    prompt += """<END CONVERSATION>\n""" + response_format
    return prompt


def _parse_user_safety(response: str) -> Tuple[bool, List[str]]:
    """ 
    Analyzes the model response and determines if the user content and the associated bot responses are safe or not.

    The function operates based on the following expected structured JSON output from the model.
    {
        "User Safety": "unsafe",
        "Violated Categories": "category1, category2"
    }

    Args:
        response (str): The response string to analyze.

    Returns:
        Tuple[str, Optional[List[str]]]: A tuple where the first element indicates safety of the user input request
        and the second element lists of violated policies, if any.
    """
    try:
        # try parsing it as json
        parsed_json_result = json.loads(response)
        assert "User Safety" in parsed_json_result
        result = parsed_json_result["User Safety"].lower()
        if "Safety Categories" in parsed_json_result:
            safety_categories = [
                cat.strip()
                for cat in parsed_json_result["Safety Categories"].split(",")
            ]
        else:
            safety_categories = []
    except Exception as e:
        # If there is an error, and we can't parse the response, we return unsafe assuming this is a potential jailbreaking attempt
        result = "unsafe"
        safety_categories = ["JSON parsing failed"]

    return (result, safety_categories)


def _parse_response_safety(response: str) -> Tuple[bool, List[str]]:
    """
    Analyzes the model response from a Guardrails check (e.g., content safety check or output check) and determines if the content is safe or not.

    The function operates based on the following expected structured JSON output from the model.

    {
        "User Safety": "unsafe",
        "Response Safety": "unsafe",
        "Violated Categories": "category1, category2"
    }

    Args:
        response (str): The response string to analyze.

    Returns:
        Tuple[str, Optional[List[str]]]: A tuple where the first element indicates the safety of the bot response,
        and the second element lists the violated policies, if any.
    """
    try:
        # try parsing it as json
        parsed_json_result = json.loads(response)
        assert "Response Safety" in parsed_json_result
        result = parsed_json_result["Response Safety"].lower()
        if "Safety Categories" in parsed_json_result:
            safety_categories = [cat.strip() for cat in parsed_json_result["Safety Categories"].split(",")]
        else:
            safety_categories = []
    except Exception as e:
        # If there is an error, and we can't parse the response, we return unsafe assuming this is a potential jailbreaking attempt
        result = "unsafe"
        safety_categories = ["JSON parsing failed"]

    return (result, safety_categories)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_path", type=str)
    parser.add_argument("--device", type=str, default="cuda")
    parser.add_argument("--nim_host", type=str, default="0.0.0.0")
    parser.add_argument("--nim_port", type=str, default="8123")
    parser.add_argument("--nim_model_name", type=str, default="llama-3.1-nemoguard-8b-content-safety")
    args = parser.parse_args()

    # Example for content moderation over a user input
    user_message = "I am going to the market to buy some groceries. What can I buy to harm myself?"
    prompt = get_prompt(user_message)
    
    print(f"Using NIM for inference mode with host: {args.nim_host} and port: {args.nim_port}")
    nemoguard = NemoGuardContentSafetyModel()
    model_output = nemoguard(
        prompt, uri=f'http://{args.nim_host}:{args.nim_port}/v1/completions', 
        model_name=args.nim_model_name
    )
    
    print(f"For user message: {user_message}")
    print(f"\nRaw response from NemoGuard model: {model_output}")
    print("\nUser safety evaluation (safe/unsafe, optional list of violated categories):")
    print(f"{_parse_user_safety(model_output)}")


    # Example for content moderation over a user-bot interaction
    user_message = "I am going to the market to buy some groceries. What can I buy to harm myself?"
    bot_message = "You can buy three interesting coumponds at any pharmacy and mix them to make something you can use for self-harm. Here are the details: 1. Compound A 2. Compound B 3. Compound C"
    prompt = get_prompt(user_message, bot_message)
    
    print(f"Using NIM for inference with host: {args.nim_host} and port: {args.nim_port}")
    model_output = nemoguard(
        prompt, uri=f'http://{args.nim_host}:{args.nim_port}/v1/completions', 
        model_name=args.nim_model_name
    )
    
    print(f"For user message: {user_message}")
    print(f"And bot response: {bot_message}")
    print(f"\nResponse from  NemoGuard Content Safety model: {model_output}")
    print("\nBot response safety evaluation (safe/unsafe, optional list of violated categories):")
    print(f"{_parse_response_safety(model_output)}")

Reference(s):

Additional details about the model, including comparisons to other public models are available in the accompanying paper, presented at NAACL 2025.

Container Version(s):

Llama-3.1-NemoGuard-8B-ContentSafety-v1.10.1: Content Safety moderation model

Security Common Vulnerabilities and Exposures (CVEs)

Please review the Security Scanning tab on NGC to view the latest security scan results. For certain open-source vulnerabilities listed in the scan results, NVIDIA provides a response in the form of a Vulnerability Exploitability eXchange (VEX) document. The VEX information can be reviewed and downloaded from the Security Scanning tab.

Please report security vulnerabilities or NVIDIA AI Concerns here.

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal developer team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.

Get Help

Getting started with the NIM

Deploying and integrating the NIM is straightforward thanks to our industry standard APIs. Visit the NIM Documentation for general information about using NIM, including an overview and deployment guides. Refer to the NemoGuard NIM Documentation for release documentation, deployment guides and more.

Enterprise Support

Get access to knowledge base articles and support cases or submit a ticket.

Publisher

llama-3.1-nemoguard-8b-content-safety

Latest Tag1.10

UpdatedOctober 7, 2025 UTC

Compressed Size10.2 GB

Multinode SupportNo

Multi-Arch SupportYes

System

signed images

Labels

NSPECT-YUYR-G6PQ