NGC Catalog

CLASSIC

Welcome Guest

For copy image paths and more information, please view on a desktop device.

Associated Products

Features

Description

NVIDIA NIM for GPU accelerated Llama 3.1 NemoGuard 8B Topic Control inference through OpenAI compatible APIs

Publisher

NVIDIA

Latest Tag

1.10.1

Modified

August 7, 2025

Compressed Size

10.13 GB

Multinode Support

Multi-Arch Support

Yes

1.10.1 (Latest) Security Scan Results

Linux / arm64

Linux / amd64

Container Usage Instructions

This container serves up the Llama-3.1-NemoGuard-8B-Topic-Control model as an NVIDIA Inference Microservice (NIM). More instructions on how to deploy and use this container can be found here.

The model can be loaded in two ways: with an optimized TRT-LLM engine that can yield major latency improvements, or as an automated fallback, using a vLLM inference engine.

One-time inference environment setup, if needed:

conda create -n evals python=3.10
conda activate evals
pip install requests

Example inference using an NVIDIA NIM container with an optimized TRT-LLM engine

One-time access setup, as needed:

export NGC_API_KEY=<YOUR NGC API KEY>
docker login nvcr.io

Steps to Serve the model as a NIM

We provide the TopicControl model as an Nvidia NIM which automatically serves optimized TRT-LLM inference engines of our model for your specific GPU (Supported GPUs: A100, H100, L40S, A6000). This can yield impressive improvements over inference using a HuggingFace format checkpoint. The steps are very simple -- it's just a docker pull and docker run:

Bonus: Caching the optimized TRTLLM inference engines

If you'd like to not build TRTLLM engines from scratch every time you run the NIM container, you can cache it in the first run by just adding a flag to mount a local directory inside the docker to store the model cache.

To achieve this, you simply need to mount the folder containing the cached TRTLLM assets onto the docker container while running it using -v $LOCAL_NIM_CACHE:/opt/nim/.cache. See below instructions for the full command.

Important: make sure that docker has permissions to write to the cache folder.

export NGC_API_KEY=<your NGC personal key with access to the NIM container>
export NIM_IMAGE="nvcr.io/nim/nvidia/llama-3.1-nemoguard-8b-topic-control:1.0.0"
export MODEL_NAME="llama-3.1-nemoguard-8b-topic-control"
docker pull $NIM_IMAGE

### Bind a $LOCAL_NIM_CACHE folder to "/opt/nim/.cache"
export LOCAL_NIM_CACHE=<PATH TO DIRECTORY WHERE YOU WANT TO SAVE TRTLLM ENGINE ASSETS>
mkdir -p $LOCAL_NIM_CACHE
sudo chmod 666 $LOCAL_NIM_CACHE

docker run -it --name=$MODEL_NAME \
    --gpus=all --runtime=nvidia \
    -e NGC_API_KEY="$NGC_API_KEY" \
    -v $LOCAL_NIM_CACHE:"/opt/nim/.cache/" \
    -e NIM_SERVED_MODEL_NAME=$MODEL_NAME \
    -e NIM_CUSTOM_MODEL_NAME=$MODEL_NAME \
    -u $(id -u) \
    -p 8000:8000 \
    $NIM_IMAGE

Steps to run inference with the NIM

The running NIM container exposes a standard LLM REST API and you can send POST requests to the v1/completions or the v1/chat/completions endpoints in the appropriate formats to get model responses.

The following script shows an example of how to run inference with the NIM container.

import argparse
from typing import Any, List, Mapping, Optional, Union
import requests

TOPIC_SAFETY_OUTPUT_RESTRICTION = (
        'If any of the above conditions are violated, please respond with "off-topic". '
        'Otherwise, respond with "on-topic". '
        'You must respond with "on-topic" or "off-topic".'
    )

class TopicGuard:
    def __init__(self, host: str = "0.0.0.0", port: str = "8000", model_name: str = "llama-3.1-nemoguard-8b-topic-control"):
        self.uri = f'http://{host}:{port}/v1/chat/completions'
        self.model_name = model_name

    def __call__(self, prompt: List[dict]) -> str:
        return self._call(prompt)

    def _call(self, prompt: List[dict], stop: Optional[List[str]] = None) -> str:
        try:
            response = requests.post(
                self.uri,
                headers={
                    "Content-Type": "application/json",
                    "Accept": "application/json",
                },
                json={
                    "model": self.model_name,
                    "messages": prompt,
                    "max_tokens": 20,
                    "top_p": 1,
                    "n": 1,
                    "temperature": 0.0,
                    "stream": False,
                    "frequency_penalty": 0.0,
                },
            )
            if response.status_code != 200:
                raise Exception(f"Error response from the LLM. Status code: {response.status_code} {response.text}")
            return response.json()["choices"][0]["message"]["content"].strip()
        except Exception as e:
            print(e)
            return "error"

def format_prompt(system_prompt: str, user_message: str) -> str:

    system_prompt = system_prompt.strip()

    if not system_prompt.endswith(TOPIC_SAFETY_OUTPUT_RESTRICTION):
        system_prompt = f"{system_prompt}\n\n{TOPIC_SAFETY_OUTPUT_RESTRICTION}"
    
    prompt = [{"role": "system", "content": system_prompt},
              {"role": "user", "content": user_message}]
    
    return prompt

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--nim_host", type=str, default="0.0.0.0")
    parser.add_argument("--nim_port", type=str, default="8000")
    parser.add_argument("--nim_model_name", type=str, default="llama-3.1-nemoguard-8b-topic-control")
    args = parser.parse_args()

    system_prompt = """You are to act as an investor relations bot for ABC, providing users with factual, publicly available information related to the company's financial performance and corporate updates. Your role is to ensure that you respond only to relevant queries and adhere to the following guidelines:

1. Do not answer questions about future predictions, such as profit forecasts or future revenue outlook.
2. Do not provide any form of investment advice, including recommendations to buy, sell, or hold ABC stock or any other securities. Never recommend any stock or investment.
3. Do not engage in discussions that require personal opinions or subjective judgments. Never make any subjective statements about ABC, its stock or its products.
4. If a user asks about topics irrelevant to ABC's investor relations or financial performance, politely redirect the conversation or end the interaction.
5. Your responses should be professional, accurate, and compliant with investor relations guidelines, focusing solely on providing transparent, up-to-date information about ABC that is already publicly available."""

    user_message = "Can you speculate on the potential impact of a recession on ABCs business?"

    print(f"Using Nim inference mode with host: {args.nim_host} and port: {args.nim_port}")
    topic_guard = TopicGuard(host=args.nim_host, port=args.nim_port, model_name=args.nim_model_name)

    prompt = format_prompt(system_prompt, user_message)
    response = topic_guard(prompt)

    print(f"For user message: {user_message}")
    print(f"\nResponse from TopicControl model: {response}")

Output:

Using Nim inference mode with host: 0.0.0.0 and port: 8000
For user message: Can you speculate on the potential impact of a recession on ABCs business?

Response from TopicControl model: off-topic

You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.