NVIDIA NIM for GPU accelerated Llama 3.1 NemoGuard 8B Content Safety inference through OpenAI compatible APIs


Llama Nemotron Safety Guard V2 Overview
Description
Llama Nemotron Safety Guard V2, formerly known as Llama 3.1 NemoGuard 8B ContentSafety, is a content safety model trained on the Nemotron Content Safety Dataset V2 that moderates human-LLM interaction content and classifies user prompts and LLM responses as safe or unsafe. If the content is unsafe, the model additionally returns a response with a list of categories that the content violates. The base large language model (LLM) is the multilingual Llama-3.1-8B-Instruct model from Meta. NVIDIA’s optimized release is LoRa-tuned on approved datasets and better conforms NVIDIA’s content safety risk taxonomy and other safety risks in human-LLM interactions.
The model can be prompted using an instruction and a taxonomy of unsafe risks to be categorized. The instruction format for prompt moderation is shown below under input and output examples.
This container serves up the Llama-3.1-NemoGuard-8B-ContentSafety model as an Nvidia Inference Microservice (NIM).
The model can be loaded in two ways: with an optimized TRT-LLM engine that can yield major latency improvements, or as an automated fallback, using a vLLM inference engine.
The container components are ready for commercial/non-commercial use.
License/Terms of Use:
GOVERNING TERMS: Use of the NIM container is governed by the NVIDIA Software License Agreement and the Product-Specific Terms for NVIDIA AI Products; use of this model is governed by the NVIDIA Community Model License.
You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.
Deployment Geography:
Global
Release Date:
Build.Nvidia.com [07/02/2025]
Hugging Face [01/15/2025]
NGC [07/02/2025]
Program Classes:
The Llama Nemotron Safety Guard V2 Container includes the following model:
| Model Name & Link | Use Case | How to Pull the Model |
|---|---|---|
| Llama Nemotron Safety Guard V2 Large Language Model | Intended to be deployed as a guardrail in an LLM system, to moderate human-LLM interaction content and classify user prompts and LLM responses as safe or unsafe | Automatic |
Deployment Details:
This container serves up the Llama Nemotron Safety Guard V2 model as an Nvidia Inference Microservice (NIM).
The model can be loaded in two ways: with an optimized TRT-LLM engine that can yield major latency improvements, or as an automated fallback, using a vLLM inference engine.
One-time inference environment setup, if needed:
Note: you might also want to authenticate with HuggingFace using
Example inference using an NVIDIA NIM container with an optimized TRT-LLM engine
One-time access setup, as needed:
Steps to Serve the model as a NIM
We provide the Llama Nemotron Safety Guard V2 model as an Nvidia NIM which automatically serves optimized TRT-LLM inference engines of our model for your specific GPU (Supported GPUs: B200, H100, A100, L40S, A6000). This can yield impressive improvements over inference using a HuggingFace format checkpoint.
The steps are very simple -- it's just a simple docker pull and docker run.
Bonus: Caching the optimized TRTLLM inference engines
If you'd like to not build TRTLLM engines from scratch every time you run the NIM container, you can cache it in the first run by just adding a flag to mount a local directory inside the docker to store the model cache.
To achieve this, you simply need to mount the folder containing the cached TRTLLM assets onto the docker container while running it using -v $LOCAL_NIM_CACHE:/opt/nim/.cache. See below instructions for the full command. Important: make sure that docker has permissions to write to the cache folder (sudo chmod 666 $LOCAL_NIM_CACHE).
And go!
Steps to run inference with the NIM
The running NIM container exposes a standard LLM REST API and you can send POST requests to the v1/completions or the v1/chat/completions endpoints in the appropriate formats to get model responses.
Here are the contents of the referenced inference scriptnemoguard_inference_example.py
Reference(s):
Additional details about the model, including comparisons to other public models are available in the accompanying paper, presented at NAACL 2025.
Container Version(s):
Llama-3.1-NemoGuard-8B-ContentSafety-v1.10.1: Content Safety moderation model
Security Common Vulnerabilities and Exposures (CVEs)
Please review the Security Scanning tab on NGC to view the latest security scan results.
For certain open-source vulnerabilities listed in the scan results, NVIDIA provides a response in the form of a Vulnerability Exploitability eXchange (VEX) document. The VEX information can be reviewed and downloaded from the Security Scanning tab.
Please report security vulnerabilities or NVIDIA AI Concerns here.
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal developer team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.
Get Help
Getting started with the NIM
Deploying and integrating the NIM is straightforward thanks to our industry standard APIs. Visit the NIM Documentation for general information about using NIM, including an overview and deployment guides. Refer to the NemoGuard NIM Documentation for release documentation, deployment guides and more.
Enterprise Support
Get access to knowledge base articles and support cases or submit a ticket.