AI Chatbots with RAG - Docker workflow
AI Chatbots with RAG - Docker Workflow
Download this resource to run enterprise RAG applications based on NVIDIA services with Docker Compose.
Example RAG Applications
- Canonical RAG Llamaindex
- Canonical RAG Langchain
- Multimodal RAG
- Multi-Turn RAG
- Query Decomposition RAG
- Structured Data based RAG
Common Customizations
- Configuring Milvus as the Vector Database
- Configuring pgvector as the Vector Database
- Inference Model: Llama 3 70B Instruct
- Custom Chain Server
Common Prerequisites
-
Install Docker Engine and Docker Compose.
-
Verify NVIDIA GPU driver version 535 or later is installed.
Refer to the NVIDIA Linux driver installation instructions for more information.
-
Install the NVIDIA Container Toolkit.
Verify the toolkit is installed and configured as the default container runtime.
-
Create an NGC account and API Key.
Refer to the instructions to create an account and generate an NGC API key.Log in to the NVIDIA container registry using the following command:
Export the
NGC_API_KEYRefer to Accessing And Pulling an NGC Container Image via the Docker CLI for more information.
01: Canonical RAG Llamaindex
This example showcases a baseline Llamaindex based RAG pipeline that deploys the NIM for LLMs microservice to host a TensorRT optimized LLM and the NeMo Retriever Embedding microservice.
Milvus is deployed as the vector database to store embeddings and generate responses to queries.
| LLM Model | Embedding | Framework | Document Type | Vector Database | Model deployment platform |
|---|---|---|---|---|---|
| meta/llama3-8b-instruct | nv-embedqa-e5-v5 | llama-index | PDF/Text | Milvus | On Prem |
Deployment
-
Complete the Common Prerequisites.
-
Run the pipeline.
-
Check status of the containers.
-
Open your browser and interact with the RAG Playground at http://localhost:3001/converse.
02: Canonical RAG Langchain
This example showcases a baseline Langchain based RAG pipeline that deploys the NIM for LLMs microservice to host a TensorRT optimized LLM and the NeMo Retriever Embedding microservice.
Milvus is deployed as the vector database to store embeddings and generate responses to queries.
| LLM Model | Embedding | Framework | Document Type | Vector Database | Model deployment platform |
|---|---|---|---|---|---|
| meta/llama3-8b-instruct | nv-embedqa-e5-v5 | llama-index | PDF/Text | Milvus | On Prem |
Deployment
-
Complete the Common Prerequisites.
-
Run the pipeline.
-
Check status of the containers.
-
Open your browser and interact with the RAG Playground at http://localhost:3001/converse.
03: Multimodal RAG
This example showcases a multimodal use case in a RAG pipeline.
The example understands any kind of image in PDF, such as graphs and plots, alongside text and tables.
The example uses multimodal models from NVIDIA API Catalog to answer queries.
| LLM Model | Embedding | Framework | Document Type | Vector Database | Model deployment platform |
|---|---|---|---|---|---|
| meta/llama3-8b-instruct for response generation, ai-google-Deplot for graph to text convertion, ai-Neva-22B for image to text convertion | nv-embedqa-e5-v5 | Custom Python | PDF with images/PPTX | Milvus | Cloud - NVIDIA API Catalog for ai-google-Deplot and ai-Neva-22B. On Prem NeMo Inference Microservices for meta/llama3-8b-instruct model |
Deployment
-
Complete the Common Prerequisites.
-
Export your NVIDIA API key in your terminal.
The key is used to access Neva 22B and Deplot models from NVIDIA API Catalog. -
Run the pipeline.
-
Check status of the containers.
-
Open your browser and interact with the RAG Playground at http://localhost:3001/converse.
04: Multi-Turn RAG
This example showcases a multi-turn use case in a RAG pipeline.
The example stores the conversation history and knowledge base in Milvus and retrieves them at runtime to understand contextual queries.
The example deploys NIM for LLMs and NeMo Retriever Embedding Microservice.
The example supports ingestion of PDF and .txt files.
The documents are ingested in a dedicated document vector store.
The prompt for the example is tuned to act as a document chat bot.
For maintaining the conversation history, the chain server stores the previous user queries and the generated answers as a text entry in a different dedicated vector store for conversation history.
Both vector stores are part of a LangChain LCEL chain as LangChain Retrievers.
When the chain is invoked with a query, the query is passed through both retrievers.
The retriever retrieves context from the document vector store and the closest matching conversation history from conversation history vector store. The document chunks retrieved from the document vector store are then passed through a reranker model to determine the most relevant top_k context. The context is then passed onto the LLM prompt for response generation.
| LLM Model | Embedding | Ranking(Optional) | Framework | Document Type | Vector Database | Model deployment platform |
|---|---|---|---|---|---|---|
| meta/llama3-8b-instruct | nv-embedqa-e5-v5 | nv-rerankqa-mistral-4b-v3 | LangChain Expression Language | PDF/Text | Milvus | On Prem |
Deployment
-
Complete the Common Prerequisites.
-
Run the pipeline.
-
Deploy the NVIDIA NeMo Retriever Text Reranking NIM container.
-
Check status of the containers.
-
Open your browser and interact with the RAG Playground at http://localhost:3001/converse.
05: Query Decomposition RAG
This example showcases a RAG use case built using the task decomposition paradigm.
The chain server breaks down a query into smaller subtasks and then combines results from different subtasks to formulate the final answer.
It uses models from NVIDIA API Catalog and LangChain as the framework.
| LLM Model | Embedding | Framework | Document Type | Vector Database | Model deployment platform |
|---|---|---|---|---|---|
| meta/llama3-70b-instruct | nv-embedqa-e5-v5 | LangChain Agent | PDF/Text | Milvus | On Prem |
Deployment
-
Complete the Common Prerequisites.
-
Run the pipeline.
-
Check status of the containers.
-
Open your browser and interact with the RAG Playground at http://localhost:3001/converse.
06: Structured Data RAG
This example demonstrates a use case of RAG with structured CSV data.
This approach does not involve embedding models or vector database solutions and instead uses PandasAI for interacting with the CSV data.
For ingestion, the structured data is loaded from a CSV file into a Pandas dataframe.
The pipeline can ingest multiple CSV files, provided they have identical columns.
Ingestion of CSV files with differing columns is not currently supported and results in an exception.
The core functionality utilizes a PandasAI agent to extract information from the dataframe.
This agent combines the query with the structure of the dataframe into an LLM prompt.
The LLM then generates Python code to extract the required information from the dataframe.
Subsequently, this generated code is executed on the dataframe, yielding the output dataframe.
To test the example, sample CSV files are available.
These are part of the structured data example Helm chart and represent a subset of the Microsoft Azure Predictive Maintenance from Kaggle.
The CSV data retrieval prompt is specifically tuned for three CSV files from this dataset: PdM_machines.csv, PdM_errors.csv, and PdM_failures.csv.
Specify the CSV file to use in the rag-app-structured-data-chatbot.yaml file within this resource by updating the environment variable CSV_NAME.
By default, the file is set to PdM_machines but you can change it to PdM_errors or PdM_failures.
Currently, customization of the CSV data retrieval prompt is not supported.
| LLM Model (Data Retrieval) | LLM Model (Response Paraphrasing) | Embedding | Framework | Document Type | Vector Database | Model deployment platform |
|---|---|---|---|---|---|---|
| meta/llama3-70b-instruct | meta/llama3-70b-instruct | Not Used | PandasAI | CSV | Not Used | On Prem |
Deployment
-
Complete the Common Prerequisites.
-
Run the pipeline.
-
Check status of the containers.
-
Open your browser and interact with the RAG Playground at http://localhost:3001/converse.
Configuring Milvus as the Vector Database
Perform the following steps to use Milvus as the vector database for an example.
-
Edit the Docker Compose file for the example, such as
rag-app-text-chatbot-llamaindex.yaml.Update the environment variables within the chain server service:
-
Optional: Stop the services running
-
Stop and then start the services:
-
Optional: View the chain server logs to confirm the vector database.
-
View the logs:
-
Try out a query after uploading any document and confirm the log output includes the vector database:
-
Configuring pgvector as the Vector Database
-
Edit the Docker Compose file for the example to use pgvector as the vector database.
Update the environment variables within the chain server service:
The preceding example shows the default values for the database user, password, and database.
To override the defaults, edit the values in the Docker Compose file, or set the values in thecompose.envfile. -
Optional: Stop the services running
-
Stop and then start the services:
-
Optional: View the chain server logs to confirm the vector database.
-
View the logs:
-
Try out a query after uploading any document and confirm the log output includes the vector database:
-
Alternative LLM Model: Deploy Meta Llama 3 70B Instruct
The Llama 3 70B Instruct model with NVIDIA NIM for LLMs is the only supported alternative model.
-
Edit the
file and update the image for thedocker-compose-nim-ms.yamlservice.nemollm-inference-microserviceUpdate the image to
, as shown in the following sample.nvcr.io/nim/meta/llama3-70b-instruct:1.0.0 -
Start the container.
Custom Chain Server
-
Clone the repository:
-
After you modify the chain server code, build the image:
-
Tag and push the container to a private registry, such as NVIDIA NGC:
-
Modify the compose file for the enterprise example to run and set your image for the chain server:
-
Start the containers by running
docker compose up -d.
Additional Resources
Learn more about how to use NVIDIA NIM microservices for RAG through our Deep Learning Institute. Access the course here.
Security considerations
The RAG applications are shared as reference architectures and are provided “as is”. The security of them in production environments is the responsibility of the end users deploying it. When deploying in a production environment, please have security experts review any potential risks and threats (including direct and indirect prompt injection); define the trust boundaries, secure the communication channels, integrate AuthN & AuthZ with appropriate access controls, keep the deployment including the containers up to date, ensure the containers are secure and free of vulnerabilities.
Licenses
By downloading or using this artifact included in the AI Chatbot workflows you agree to the terms of the NVIDIA Software License Agreement and Product-specific Terms for AI products.