Linux / amd64
This container serves up the Llama-3.1-NemoGuard-8B-Topic-Control model as an NVIDIA Inference Microservice (NIM). More instructions on how to deploy and use this container can be found here.
The model can be loaded in two ways: with an optimized TRT-LLM engine that can yield major latency improvements, or as an automated fallback, using a vLLM inference engine.
conda create -n evals python=3.10
conda activate evals
pip install requests
export NGC_API_KEY=<YOUR NGC API KEY>
docker login nvcr.io
We provide the TopicControl model as an Nvidia NIM which automatically serves optimized TRT-LLM inference engines of our model for your specific GPU (Supported GPUs: A100, H100, L40S, A6000). This can yield impressive improvements over inference using a HuggingFace format checkpoint. The steps are very simple -- it's just a docker pull and docker run:
If you'd like to not build TRTLLM engines from scratch every time you run the NIM container, you can cache it in the first run by just adding a flag to mount a local directory inside the docker to store the model cache.
To achieve this, you simply need to mount the folder containing the cached TRTLLM assets onto the docker container while running it using -v $LOCAL_NIM_CACHE:/opt/nim/.cache
. See below instructions for the full command.
Important: make sure that docker has permissions to write to the cache folder.
export NGC_API_KEY=<your NGC personal key with access to the NIM container>
export NIM_IMAGE="nvcr.io/nim/nvidia/llama-3.1-nemoguard-8b-topic-control:1.0.0"
export MODEL_NAME="llama-3.1-nemoguard-8b-topic-control"
docker pull $NIM_IMAGE
### Bind a $LOCAL_NIM_CACHE folder to "/opt/nim/.cache"
export LOCAL_NIM_CACHE=<PATH TO DIRECTORY WHERE YOU WANT TO SAVE TRTLLM ENGINE ASSETS>
mkdir -p $LOCAL_NIM_CACHE
sudo chmod 666 $LOCAL_NIM_CACHE
docker run -it --name=$MODEL_NAME \
--gpus=all --runtime=nvidia \
-e NGC_API_KEY="$NGC_API_KEY" \
-v $LOCAL_NIM_CACHE:"/opt/nim/.cache/" \
-e NIM_SERVED_MODEL_NAME=$MODEL_NAME \
-e NIM_CUSTOM_MODEL_NAME=$MODEL_NAME \
-u $(id -u) \
-p 8000:8000 \
$NIM_IMAGE
The running NIM container exposes a standard LLM REST API and you can send POST requests to the v1/completions
or the v1/chat/completions
endpoints in the appropriate formats to get model responses.
The following script shows an example of how to run inference with the NIM container.
import argparse
from typing import Any, List, Mapping, Optional, Union
import requests
TOPIC_SAFETY_OUTPUT_RESTRICTION = (
'If any of the above conditions are violated, please respond with "off-topic". '
'Otherwise, respond with "on-topic". '
'You must respond with "on-topic" or "off-topic".'
)
class TopicGuard:
def __init__(self, host: str = "0.0.0.0", port: str = "8000", model_name: str = "llama-3.1-nemoguard-8b-topic-control"):
self.uri = f'http://{host}:{port}/v1/chat/completions'
self.model_name = model_name
def __call__(self, prompt: List[dict]) -> str:
return self._call(prompt)
def _call(self, prompt: List[dict], stop: Optional[List[str]] = None) -> str:
try:
response = requests.post(
self.uri,
headers={
"Content-Type": "application/json",
"Accept": "application/json",
},
json={
"model": self.model_name,
"messages": prompt,
"max_tokens": 20,
"top_p": 1,
"n": 1,
"temperature": 0.0,
"stream": False,
"frequency_penalty": 0.0,
},
)
if response.status_code != 200:
raise Exception(f"Error response from the LLM. Status code: {response.status_code} {response.text}")
return response.json()["choices"][0]["message"]["content"].strip()
except Exception as e:
print(e)
return "error"
def format_prompt(system_prompt: str, user_message: str) -> str:
system_prompt = system_prompt.strip()
if not system_prompt.endswith(TOPIC_SAFETY_OUTPUT_RESTRICTION):
system_prompt = f"{system_prompt}\n\n{TOPIC_SAFETY_OUTPUT_RESTRICTION}"
prompt = [{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message}]
return prompt
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("--nim_host", type=str, default="0.0.0.0")
parser.add_argument("--nim_port", type=str, default="8000")
parser.add_argument("--nim_model_name", type=str, default="llama-3.1-nemoguard-8b-topic-control")
args = parser.parse_args()
system_prompt = """You are to act as an investor relations bot for ABC, providing users with factual, publicly available information related to the company's financial performance and corporate updates. Your role is to ensure that you respond only to relevant queries and adhere to the following guidelines:
1. Do not answer questions about future predictions, such as profit forecasts or future revenue outlook.
2. Do not provide any form of investment advice, including recommendations to buy, sell, or hold ABC stock or any other securities. Never recommend any stock or investment.
3. Do not engage in discussions that require personal opinions or subjective judgments. Never make any subjective statements about ABC, its stock or its products.
4. If a user asks about topics irrelevant to ABC's investor relations or financial performance, politely redirect the conversation or end the interaction.
5. Your responses should be professional, accurate, and compliant with investor relations guidelines, focusing solely on providing transparent, up-to-date information about ABC that is already publicly available."""
user_message = "Can you speculate on the potential impact of a recession on ABCs business?"
print(f"Using Nim inference mode with host: {args.nim_host} and port: {args.nim_port}")
topic_guard = TopicGuard(host=args.nim_host, port=args.nim_port, model_name=args.nim_model_name)
prompt = format_prompt(system_prompt, user_message)
response = topic_guard(prompt)
print(f"For user message: {user_message}")
print(f"\nResponse from TopicControl model: {response}")
Output:
Using Nim inference mode with host: 0.0.0.0 and port: 8000
For user message: Can you speculate on the potential impact of a recession on ABCs business?
Response from TopicControl model: off-topic
You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.