NGC Catalog
CLASSIC
Welcome Guest
Helm Charts
RAG Application: Llamaindex Text QA Chatbot

RAG Application: Llamaindex Text QA Chatbot

For versions and more information, please view on a desktop device.
Logo for RAG Application: Llamaindex Text QA Chatbot
Description
A helm chart demonstrating a basic RAG pipeline built using llamaindex leveraging Nvidia NIM LLM's and Retrievers deployed on-prem.
Publisher
NVIDIA
Latest Version
24.08
Compressed Size
9.01 KB
Modified
August 26, 2024

Llamaindex Text QA Chatbot

Description

This example showcases RAG pipeline. It uses Llamaindex, nemollm inference microservice to host trt optimized llm and nemollm retriever embedding microservice. It uses milvus as vectorstore to store embeddings and generate response for query.

LLM Model Embedding Framework Document Type Vector Database Model deployment platform
meta/llama3-8b-instruct nv-embedqa-e5-v5 llama-index PDF/Text milvus On Prem

Prerequisites

  • You have the NGC CLI available on your client machine. You can download the CLI from https://ngc.nvidia.com/setup/installers/cli.

  • You have Kubernetes installed and running Ubuntu 22.04. Refer to the Kubernetes documentation or the NVIDIA Cloud Native Stack repository for more information.

  • You have a default storage class available in the cluster for PVC provisioning. One option is the local path provisioner by Rancher. Refer to the installation section of the README in the GitHub repository.

    kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.26/deploy/local-path-storage.yaml
    kubectl get pods -n local-path-storage
    kubectl get storageclass
    
  • If the local path storage class is not set as default, it can be made default using the command below

    kubectl patch storageclass local-path -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
    
  • You have installed the NVIDIA GPU Operator following steps here

Deployment

  1. Fetch the Helm Chart from NGC

    helm fetch https://helm.ngc.nvidia.com/nvidia/aiworkflows/charts/rag-app-text-chatbot-llamaindex-24.08.tgz --username='$oauthtoken' --password=<YOUR API KEY>
    
  2. Deploy NVIDIA NIM LLM and NVIDIA NeMo Retriever Embedding Microservice following steps in this section.

  3. Deploy Milvus vectorstore following steps in this section.

  4. Create the example namespace

kubectl create namespace canonical-rag-llamaindex
  1. Export the NGC API Key in the environment.
export NGC_CLI_API_KEY="<YOUR NGC API KEY>"
  1. Create the Helm pipeline instance and start the services.
helm install canonical-rag-llamaindex rag-app-text-chatbot-llamaindex-24.08.tgz -n canonical-rag-llamaindex --set imagePullSecret.password=$NGC_CLI_API_KEY
  1. Verify the pods are running and ready.
kubectl get pods -n canonical-rag-llamaindex

Example Output

NAME                              READY   STATUS    RESTARTS   AGE
chain-server-748bb5c5ff-58cw7     1/1     Running   0          71s
rag-playground-855c7b9f65-qv42k   1/1     Running   0          71s
  1. Access the app using port-forwarding.
kubectl port-forward service/rag-playground -n canonical-rag-llamaindex 30001:3001

Open browser and access the rag-playground UI using http://localhost:30001/converse

Install the NVIDIA GPU Operator

Use the NVIDIA GPU Operator to install, configure, and manage the NVIDIA GPU driver and NVIDIA container runtime on the Kubernetes node.

  1. Add the NVIDIA Helm repository:

    helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
       && helm repo update
    
  2. Install the Operator:

    helm install --wait --generate-name \
       -n gpu-operator --create-namespace \
       nvidia/gpu-operator
    
  3. Optional: Configure GPU time-slicing if you have fewer than three GPUs.

    • Create a file, time-slicing-config-all.yaml, with the following content:

      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: time-slicing-config-all
      data:
        any: |-
          version: v1
          flags:
            migStrategy: none
          sharing:
            timeSlicing:
              resources:
              - name: nvidia.com/gpu
                replicas: 3
      

      The sample configuration creates three replicas from each GPU on the node.

    • Add the config map to the Operator namespace:

      kubectl create -n gpu-operator -f time-slicing-config-all.yaml
      
    • Configure the device plugin with the config map and set the default time-slicing configuration:

      kubectl patch clusterpolicies.nvidia.com/cluster-policy \
          -n gpu-operator --type merge \
          -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config-all", "default": "any"}}}}'
      
    • Verify that at least 3 GPUs are allocatable:

      kubectl get nodes -l nvidia.com/gpu.present -o json | jq '.items[0].status.allocatable | with_entries(select(.key | startswith("nvidia.com/"))) | with_entries(select(.value != "0"))'
      

      Example Output

      {
        "nvidia.com/gpu": "3"
      }
      

Deploying NVIDIA NIM Microservices

Deploying NVIDIA NIM for LLMs

(Default flow deploys meta/llama3-8b-instruct)
  1. Follow the steps from nim-deploy repository to deploy NIM LLM microservice with meta/llama3-8b-instruct as default LLM model.

Deploying NVIDIA Nemo Retriever Embedding Microservice

  1. Follow steps from here to fetch and deploy the Nemo Retriever Embedding Microservice Helm chart.

Note: While deploying the NREM helm chart, use below step to forcefully set the embedding image path to the GA version of embedding model.

helm upgrade --install \
  --namespace nrem \
  --set image.repository= nvcr.io/nim/nvidia/nv-embedqa-e5-v5 \
  --set image.tag=1.0.0 \
  nemo-embedder \
  text-embedding-nim-1.0.0.tgz

Deploying Milvus Vectorstore Helm Chart

  1. Create a new nanespace for vectorstore
kubectl create namespace vectorstore
  1. Add the milvus repository
helm repo add milvus https://zilliztech.github.io/milvus-helm/
  1. Update the helm repository
helm repo update
  1. Create a file named custom_value.yaml with below content to utilize GPU's
standalone:
  resources:
    requests:
      nvidia.com/gpu: "1"
    limits:
      nvidia.com/gpu: "1"
  1. Install the helm chart and point to the above created file using -f argument as shown below.
helm install milvus milvus/milvus --set cluster.enabled=false --set etcd.replicaCount=1 --set minio.mode=standalone --set pulsar.enabled=false -f custom_value.yaml -n vectorstore
  1. Check status of the pods
kubectl get pods -n vectorstore
  1. All pods should be running and in a ready state within couple of minutes
NAME                                READY   STATUS    RESTARTS        AGE
milvus-etcd-0                       1/1     Running   0               5m34s
milvus-minio-76f9d647d5-44799       1/1     Running   0               5m34s
milvus-standalone-9ccf56df4-m4tpm   1/1     Running   3 (4m35s ago)   5m34

Configuring Examples

You can configure various parameters such as prompts and vectorstore using environment variables. Modify the environment variables in the env section of the query service in the values.yaml file of the respective examples.

Configuring Prompts

---
depth: 2
local: true
backlinks: none
---

Each example utilizes a prompt.yaml file that defines prompts for different contexts. These prompts guide the RAG model in generating appropriate responses. You can tailor these prompts to fit your specific needs and achieve desired responses from the models.

Accessing Prompts

The prompts are loaded as a Python dictionary within the application. To access this dictionary, you can use the get_prompts() function provided by the utils module. This function retrieves the complete dictionary of prompts.

Consider we have following prompt.yaml file which is under files directory for all the helm charts

chat_template: |
    You are a helpful, respectful and honest assistant.
    Always answer as helpfully as possible, while being safe.
    Please ensure that your responses are positive in nature.

rag_template: |
    You are a helpful AI assistant named Envie.
    You will reply to questions only based on the context that you are provided.
    If something is out of context, you will refrain from replying and politely decline to respond to the user.

You can access it's chat_template using following code in you chain server

from RAG.src.chain_server.utils import get_prompts

prompts = get_prompts()

chat_template = prompts.get("chat_template", "")

Once you have updated the prompt you can update the deployment for any of the examples by using the command below.

helm upgrade <rag-example-name> <rag-example-helm-chart-path> -n <rag-example-namespace> --set imagePullSecret.password=$NGC_CLI_API_KEY

Configuring VectorStore

The vector store can be modified from environment variables. You can update:

  1. APP_VECTORSTORE_NAME: This is the vector store name. Currently, we support milvus and pgvector Note: This only specifies the vector store name. The vector store container needs to be started separately.

  2. APP_VECTORSTORE_URL: The host machine URL where the vector store is running.

Additional Resources

Learn more about how to use NVIDIA NIM microservices for RAG through our Deep Learning Institute. Access the course here.

Security considerations

The RAG applications are shared as reference architectures and are provided “as is”. The security of them in production environments is the responsibility of the end users deploying it. When deploying in a production environment, please have security experts review any potential risks and threats (including direct and indirect prompt injection); define the trust boundaries, secure the communication channels, integrate AuthN & AuthZ with appropriate access controls, keep the deployment including the containers up to date, ensure the containers are secure and free of vulnerabilities.

Licenses

By downloading or using NVIDIA NIM inference microservices included in the AI Chatbot workflows you agree to the terms of the NVIDIA Software License Agreement and Product-specific Terms for AI products.