This example showcases RAG pipeline. It uses Llamaindex, nemollm inference microservice to host trt optimized llm and nemollm retriever embedding microservice. It uses milvus as vectorstore to store embeddings and generate response for query.
LLM Model | Embedding | Framework | Document Type | Vector Database | Model deployment platform |
---|---|---|---|---|---|
meta/llama3-8b-instruct | nv-embedqa-e5-v5 | llama-index | PDF/Text | milvus | On Prem |
You have the NGC CLI available on your client machine. You can download the CLI from https://ngc.nvidia.com/setup/installers/cli.
You have Kubernetes installed and running Ubuntu 22.04. Refer to the Kubernetes documentation or the NVIDIA Cloud Native Stack repository for more information.
You have a default storage class available in the cluster for PVC provisioning. One option is the local path provisioner by Rancher. Refer to the installation section of the README in the GitHub repository.
kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.26/deploy/local-path-storage.yaml
kubectl get pods -n local-path-storage
kubectl get storageclass
If the local path storage class is not set as default, it can be made default using the command below
kubectl patch storageclass local-path -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
You have installed the NVIDIA GPU Operator following steps here
Fetch the Helm Chart from NGC
helm fetch https://helm.ngc.nvidia.com/nvidia/aiworkflows/charts/rag-app-text-chatbot-llamaindex-24.08.tgz --username='$oauthtoken' --password=<YOUR API KEY>
Deploy NVIDIA NIM LLM and NVIDIA NeMo Retriever Embedding Microservice following steps in this section.
Deploy Milvus vectorstore following steps in this section.
Create the example namespace
kubectl create namespace canonical-rag-llamaindex
export NGC_CLI_API_KEY="<YOUR NGC API KEY>"
helm install canonical-rag-llamaindex rag-app-text-chatbot-llamaindex-24.08.tgz -n canonical-rag-llamaindex --set imagePullSecret.password=$NGC_CLI_API_KEY
kubectl get pods -n canonical-rag-llamaindex
Example Output
NAME READY STATUS RESTARTS AGE
chain-server-748bb5c5ff-58cw7 1/1 Running 0 71s
rag-playground-855c7b9f65-qv42k 1/1 Running 0 71s
kubectl port-forward service/rag-playground -n canonical-rag-llamaindex 30001:3001
Open browser and access the rag-playground UI using http://localhost:30001/converse
Use the NVIDIA GPU Operator to install, configure, and manage the NVIDIA GPU driver and NVIDIA container runtime on the Kubernetes node.
Add the NVIDIA Helm repository:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
Install the Operator:
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator
Optional: Configure GPU time-slicing if you have fewer than three GPUs.
Create a file, time-slicing-config-all.yaml
, with the following content:
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config-all
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 3
The sample configuration creates three replicas from each GPU on the node.
Add the config map to the Operator namespace:
kubectl create -n gpu-operator -f time-slicing-config-all.yaml
Configure the device plugin with the config map and set the default time-slicing configuration:
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
-n gpu-operator --type merge \
-p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config-all", "default": "any"}}}}'
Verify that at least 3
GPUs are allocatable:
kubectl get nodes -l nvidia.com/gpu.present -o json | jq '.items[0].status.allocatable | with_entries(select(.key | startswith("nvidia.com/"))) | with_entries(select(.value != "0"))'
Example Output
{
"nvidia.com/gpu": "3"
}
meta/llama3-8b-instruct
as default LLM model.Note: While deploying the NREM helm chart, use below step to forcefully set the embedding image path to the GA version of embedding model.
helm upgrade --install \
--namespace nrem \
--set image.repository= nvcr.io/nim/nvidia/nv-embedqa-e5-v5 \
--set image.tag=1.0.0 \
nemo-embedder \
text-embedding-nim-1.0.0.tgz
kubectl create namespace vectorstore
helm repo add milvus https://zilliztech.github.io/milvus-helm/
helm repo update
standalone:
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
helm install milvus milvus/milvus --set cluster.enabled=false --set etcd.replicaCount=1 --set minio.mode=standalone --set pulsar.enabled=false -f custom_value.yaml -n vectorstore
kubectl get pods -n vectorstore
NAME READY STATUS RESTARTS AGE
milvus-etcd-0 1/1 Running 0 5m34s
milvus-minio-76f9d647d5-44799 1/1 Running 0 5m34s
milvus-standalone-9ccf56df4-m4tpm 1/1 Running 3 (4m35s ago) 5m34
You can configure various parameters such as prompts and vectorstore using environment variables. Modify the environment variables in the env
section of the query service in the values.yaml file of the respective examples.
---
depth: 2
local: true
backlinks: none
---
Each example utilizes a prompt.yaml file that defines prompts for different contexts. These prompts guide the RAG model in generating appropriate responses. You can tailor these prompts to fit your specific needs and achieve desired responses from the models.
The prompts are loaded as a Python dictionary within the application. To access this dictionary, you can use the get_prompts()
function provided by the utils
module. This function retrieves the complete dictionary of prompts.
Consider we have following prompt.yaml
file which is under files
directory for all the helm charts
chat_template: |
You are a helpful, respectful and honest assistant.
Always answer as helpfully as possible, while being safe.
Please ensure that your responses are positive in nature.
rag_template: |
You are a helpful AI assistant named Envie.
You will reply to questions only based on the context that you are provided.
If something is out of context, you will refrain from replying and politely decline to respond to the user.
You can access it's chat_template using following code in you chain server
from RAG.src.chain_server.utils import get_prompts
prompts = get_prompts()
chat_template = prompts.get("chat_template", "")
Once you have updated the prompt you can update the deployment for any of the examples by using the command below.
helm upgrade <rag-example-name> <rag-example-helm-chart-path> -n <rag-example-namespace> --set imagePullSecret.password=$NGC_CLI_API_KEY
The vector store can be modified from environment variables. You can update:
APP_VECTORSTORE_NAME
: This is the vector store name. Currently, we support milvus
and pgvector
Note: This only specifies the vector store name. The vector store container needs to be started separately.
APP_VECTORSTORE_URL
: The host machine URL where the vector store is running.
Learn more about how to use NVIDIA NIM microservices for RAG through our Deep Learning Institute. Access the course here.
The RAG applications are shared as reference architectures and are provided “as is”. The security of them in production environments is the responsibility of the end users deploying it. When deploying in a production environment, please have security experts review any potential risks and threats (including direct and indirect prompt injection); define the trust boundaries, secure the communication channels, integrate AuthN & AuthZ with appropriate access controls, keep the deployment including the containers up to date, ensure the containers are secure and free of vulnerabilities.
By downloading or using NVIDIA NIM inference microservices included in the AI Chatbot workflows you agree to the terms of the NVIDIA Software License Agreement and Product-specific Terms for AI products.