This sample application demonstrates a fully functional co-pilot system performing Retrieval Augmented Generation with: - NVIDIA TensorRT - NVIDIA Triton - Milvus - Llama Index + LangChain - Meta's Llama2 models




November 21, 2023
Augmenting an existing AI foundational model provides an advanced starting point and a low-cost solution that enterprises can leverage to generate accurate and clear responses to their specific use case. The Retrieval Augmented Generation (RAG)-based AI chatbot workflow accelerates building and deploying enterprise LLM solutions and is currently in private, early access for our NVIDIA AI Enterprise customers.

This RAG-based reference chatbot workflow contains:

  • NVIDIA NeMo framework - part of NVIDIA AI Enterprise solution
  • NVIDIA TensorRT LLM (TRT-LLM) for low latency and high throughput inference for LLMs
  • LangChain and LlamaIndex for combining language model components and easily constructing question-answering from a company’s database
  • Sample Jupyter Notebooks and chatbot web application/API calls so that you can test the chat system in an interactive manner

Key benefits include:

  • NeMo powered LLM generates responses based on real-time information from the company’s knowledge base.
  • Accelerated inference with Triton Inference server and TRT-LLM
  • The entire workflow can be deployed on your preferred on-prem and cloud platform.

Getting Started

To get started, review the documentation linked below and learn what is included in this RAG-based AI chatbot workflow, and how to run the workflow.


By accessing NeMo as part of the AI chatbot with RAG workflow, you accept the terms and conditions of this End User License Agreement.