This recipe contains information and scripts to benchmark the speed performance of NVIDIA's Enterprise RAG Blueprint. This assumes you have already successfully deployed the RAG Blueprint in a K8s cluster and your service is ready to receive requests.
The scripts will launch a Kubernetes pod in the namespace of your deployed RAG pipeline and send requests to both the RAG service as well as directly to the LLM to gather latency and throughput metrics with a real-time inference use case across concurrency levels.
GenAI-Perf tool is used generate the requests and capture the metrics: First Token Latency (FTL), Inter-Token Latency (ITL), Request Latency, Request Throughput, and Token Throughput metrics.
For real-time inference we typically focus on FTL and ITL as they define the user experience. Offline or batch inference would typically focus on throughput metrics.
service:port
of both your RAG service and LLM deployed in the RAG Pipelinenamespace
in which your RAG pipeline is deployedNGC_API_KEY
(NGC Registry for access) which provides access to the following container:nvcr.io/nvidia/tritonserver:<year>.<month>-py3-sdk
py3-sdk
tag in the Triton Inference Server page of the NGC catalog. Only the -sdk
tags have GenAI-Perf which is required for benchmarking.HF_TOKEN
(HuggingFace for access)Set your default namespace to the one in which your RAG pipeline is deployed
kubectl config set-context --current --namespace=<rag-pipeline-namespace>
Start a pod on your Kubernetes cluster where your RAG pipeline is deployed using the latest triton container with py3-sdk
tag
The pod name here is benchmark
, but the pod name can be whatever you want
kubectl run benchmark --image=nvcr.io/nvidia/tritonserver:<year>.<month>-py3-sdk --command -- sleep 100000
Update the speed-config.sh
Add in your RAG service and LLM (NON-RAG) service and port as defined in your RAG deployment
export RAG_SERVICE="" # <service-name>:<port>
export NON_RAG_SERVICE="" # <service-name>:<port>
Update the NIM model and tokenizer (only if you are using a model different than llama3.1 70B Instruct)
export NIM_MODEL_NAME="meta/llama-3.1-70b-instruct"
export NIM_MODEL_NAME_cleaned="meta-llama-3.1-70b-instruct"
export NIM_MODEL_TOKENIZER="meta-llama/Meta-Llama-3-70B-Instruct"
Update the namespace you are using in your Vector DB. Update the # of tokens per chunk_size (determined in ingestion phase) and # of tokens in the system prompt
export NAMESPACE="wikipedia" # Vector database namespace
export CHUNK_SIZE=420 # Number of tokens
export RAG_PROMPT_EXTRA=100 # Number of Tokens
The default values for the VDB_K and Reranker_K values are 20 and 4 respectively (20 chunks are retrieved from the VDB and the Reranker ranks and filters to 4) If you changed these values in your deployment, adjust them here too.
export VDB_and_RERANKER_K="20/4"
Lastly update the metadata about the experiment you are running and where to save output files
export GPU="H100"
export CLUSTER="My-Cluster"
export EXPERIMENT_NAME="RAG-Blueprint"
export OUTPUT="/tmp/output/"
Copy sweep-speed.sh
and speed-config.sh
script into the pod
kubectl cp sweep-speed.sh benchmark:sweep-speed.sh
kubectl cp speed-config.yaml benchmark:speed-config.yaml
Enter the pod
kubectl exec --stdin --tty benchmark -- /bin/bash
Install and login to HuggingFace for the Tokenizer access using HF_TOKEN (optional if not using a tokenizer that requires HF access)
pip install -U "huggingface_hub[cli]"
export HF_TOKEN=<hf-token>
huggingface-cli login --token $HF_TOKEN
Run sweeper
. ./sweep-speed.sh
Example output
Total experiments to run: 45
[2025-02-28 05:10:30] Starting experiment: RAG-Off_CR:1_UseCase:chat_ISL:128_OSL:128_Model:meta-llama-3.1-70b-instruct_Cluster:_GPU:H100_Experiment:RAG-Blueprint_2025-02-28_05:10:30.883
genai-perf profile -m meta/llama-3.1-70b-instruct --endpoint-type chat --service-kind openai --streaming -u $SERVICE --num-prompts $total_requests --synthetic-input-tokens-mean $Request_ISL --synthetic-input-tokens-stddev 0 --concurrency $CR $NAMESPACE_PARAM --output-tokens-mean $OSL --extra-inputs max_tokens:$OSL --extra-inputs min_tokens:$OSL --extra-inputs ignore_eos:true --artifact-dir /tmp/output//$EXPORT_FILE --tokenizer meta-llama/Meta-Llama-3-70B-Instruct -- -v --max-threads=$CR --request-count $total_requests
...
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│ Time To First Token (ms) │ 59.16 │ 58.37 │ 71.83 │ 65.92 │ 59.41 │ 59.14 │
│ Time To Second Token (ms) │ 26.14 │ 25.64 │ 27.17 │ 26.73 │ 26.24 │ 26.20 │
│ Request Latency (ms) │ 3,397.78 │ 3,370.49 │ 3,416.71 │ 3,408.67 │ 3,399.19 │ 3,398.49 │
│ Inter Token Latency (ms) │ 26.34 │ 25.70 │ 27.38 │ 27.27 │ 26.50 │ 26.30 │
│ Output Sequence Length (tokens) │ 127.74 │ 123.00 │ 131.00 │ 129.53 │ 128.00 │ 128.00 │
│ Input Sequence Length (tokens) │ 128.00 │ 128.00 │ 128.00 │ 128.00 │ 128.00 │ 128.00 │
│ Output Token Throughput (per sec) │ 37.59 │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request Throughput (per sec) │ 0.29 │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request Count (count) │ 50.00 │ N/A │ N/A │ N/A │ N/A │ N/A │
└───────────────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Progress: 1 / 45 completed.
See the results in the output
export OUTPUT="/tmp/output/"
ls /tmp/output/
RAG-Off_CR:1_UseCase:chat_ISL:128_OSL:128_Model:meta-llama-3.1-70b-instruct_Cluster:_GPU:H100_Experiment:RAG-Blueprint_2025-02-28_05:10:30.883
RAG-On_CHUNK-SIZE:420_SYS-PROMPT-SIZE:100_VDB-K:20_RERANKER-K:4_CR:1_UseCase:chat_ISL:128_OSL:128_Model:meta-llama-3.1-70b-instruct_Cluster:_GPU:H100_Experiment:RAG-Blueprint_2025-02-28_05:14:24.628
....
Find the output CSV per experiment directory:
/tmp/output/RAG-Off_CR:1_UseCase:chat_ISL:128_OSL:128_Model:meta-llama-3.1-70b-instruct_Cluster:_GPU:H100_Experiment:RAG-Blueprint_2025-02-28_05:10:30.883/profile_export_genai_perf.csv
Example output of a CSV.
ex.
Metric,avg,min,max,p99,p95,p90,p75,p50,p25
Time To First Token (ms),23.94,22.92,30.75,30.10,27.51,24.26,23.38,23.20,23.02
Inter Token Latency (ms),11.54,11.45,11.70,11.69,11.64,11.58,11.55,11.53,11.52
Request Latency (ms),"5,919.16","5,912.73","5,934.45","5,933.31","5,928.74","5,923.03","5,920.17","5,918.56","5,915.29"
Output Sequence Length,511.90,505.00,516.00,515.91,515.55,515.10,512.75,512.00,511.25
Input Sequence Length,127.50,126.00,128.00,128.00,128.00,128.00,128.00,128.00,127.00
Metric,Value
Output Token Throughput (per sec),86.48
Request Throughput (per sec),0.17
sweep-speed.sh
runs a sweep over various parameters:
ISL/OSL
- Input Sequence Length / Output Sequence LengthConcurrency
- # of requests in flight at any given timeVDB_K
- # of chunks to retrieve from the Vector DatabaseReranker_K
- # of chunks the Reranker will rank and filterIt runs at least three GenAI Perf commands using each combination of ISL/OSL/Concurrency above:
RAG Off
- Just sending request to the LLM by itselfRAG Off with extra ISL
- Just sending request to the LLM by itself with the approximated extra tokens that the RAG context vectors + system prompting add to the input sequence lengthRAG On
- Send the request to the full RAG serviceRAG On
and RAG Off with extra ISL
have additional GenAI Perf commands depending on the combination of K Chunks requested. RAG Blueprint defaults to
ISL=128
, OSL=512
, Concurrency=1
, VDB_K="1 5 10"
Type | Combination | # of GenAI Perf Commands |
---|---|---|
RAG Off | (128,512,1,null) | 1 |
RAG Off with extra ISL | (128,512,1,1), (128,512,1,5), (128,512,1,10) | 3 |
RAG On | (128,512,1,1), (128,512,1,5), (128,512,1,10) | 3 |
The red line indicates how fast we can process user requests with no extra context. The green line indicates how fast we can process user requests with the extra context (assuming we got that extra context with no additional overhead). The blue line represents our normal RAG pipeline.
The most impactful parameters of your pipeline will be the relationship between chunk_size, VDB-K and Reranker-K in terms of speed and accuracy. The LLM is likely the largest bottleneck but those parameters will impact how much your LLM will actually need to process at any point in time. Reducing those three parameters will speed things up but there is a speed/accuracy tradeoff.
Defining the latency constraints your users will expect will help you tune your pipeline while balancing speed and performance tradeoffs. Remember that no one likes a slow service, but no one will use your service if you are providing poor results.
Metric | Perspective | Measurement | Description | Notes |
---|---|---|---|---|
First Token Latency (FTL) | User | p95 | How long until I get my first token | Example goal/constraint: < 1 second. Responsiveness of the service – critical for interactive applications. Highly dependent on ISL and the number/size of context vectors added to the ISL. |
Inter Token Latency (ITL) | User | p95 | How long do I wait between tokens | Example goal/constraint: ex. < 100ms. 100 ms ~= 450 words per minute per user, faster than a typical person can read. "Fast enough" is good enough in many use cases. This constraint might be much lower for use cases such as code generation |
Request Latency (end-to-end) | User | p95 | How long until I get my final token | If I just want to copy/paste a summary or code generation, then I only care about when I get my last token. e2e latency = FTL + ((OSL -1) * ITL). |
Tokens Throughput (token/sec) | Service | Average | How many tokens is the service outputting per second | This metric can be misleading since it ignores the user experience waiting for tokens |
Request Throughput (req/sec) | Service | Average | How many requests is the service outputting per second | This metric can be misleading since it ignores the user experience waiting for tokens |
Inference use cases like chat, summarization and analysis are categorized by the typical size of requests and responses - measured in tokens. Input sequence length (ISL) and output sequence length (OSL) combinations will be used to define those use cases. For example, a chat application might expect short ISL/OSL such as 128/128 whereas summarization is reflected better by 2048/512.
Concurrency is the number of parallel requests that your service is actively handling at the same time for a sustained period of time - it is not a one-time batch size. This more accuractely reflects the real needs of a real-time inference service.
The default use cases and concurrency numbers are examples and it is recommended to adjust them to reflect your expected business needs.
CHUNK_SIZE
is the approximated size of the chunks of data received from the VDB. It is measured in tokens.
RAG Off with extra ISL
to accurately represent the REAL chunk_size seen in the RAG On
scenarioRAG_PROMPT_EXTRA
is the approximated extra tokens being added to a user's prompt in a RAG system often called the system prompt. This is only used for RAG Off with extra ISL
to accurately represent the REAL extra prompting seen in the RAG On
scenario