NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

This recipe provides instructions and scripts for benchmarking the speed performance of NVIDIA Inference Microservices NIM using Llama 3 70B model

Publisher

NVIDIA

Latest Version

25.03

Modified

April 18, 2025

Compressed Size

8.47 KB

Overview

This recipe provides instructions and scripts for benchmarking the speed performance of NVIDIA Inference Microservices NIMs.

The script initializes containerized environments for inference benchmarking. It first deploys a container hosting the NIM service. A second container is launched with the Triton Server SDK and waits for the NIM service to be ready to receive requests. Then the second container executes the GenAI-Perf tool, systematically sending inference requests while capturing real-time performance metrics, including latency and throughput, across multiple concurrency levels.

GenAI-Perf tool is used generate the requests and capture the metrics: First Token Latency (FTL), Inter-Token Latency (ITL), Request Latency, Request Throughput, and Token Throughput metrics.

For real-time inference its essential to stay within First Token Latency and Inter-Token Latency constraints as they define the user experience. Offline or batch inference generally prioritizes maximum throughput metrics.

Prerequisites

NGC_API_KEY (NGC Registry for access) which provides access to the following containers:
- nvcr.io/nvidia/tritonserver:25.01-py3-sdk
- nvcr.io/nim/meta/llama3-70b-instruct:1.0.3
  - The Llama3 70B container
Install NGC CLI
- ngc config set after installation requires NGC_API_KEY from first prerequisite.
HF_TOKEN (HuggingFace for access)
- Access to Llama 3.x must be requested through Meta's website then requested on the HuggingFace Llama page. The approval process is not automatic and could take a day or more.

Prepare Environment

Slurm

We reference a number of Slurm commands and parameters in this document. A brief summary is included below. It's important to note these are a guide and might not be applicable to all environments. Please consult with your system administrator for the parameters that are specific to your system.

Common parameters:

SBATCH_PARTITION or -p - Partition (or queue) to use.
SBATCH_ACCOUNT or -A - Slurm account to associate with your job, different from your user. Meant for accounting purposes.
SBATCH_GPUS_PER_NODE or --gres=gpu:<num gpus> - If your cluster is configured with GRES this should be set to all GPUs in a node. Ignore if not configured.
- Encountering errors such as 'GPUs not found' or 'Cannot submit to this partition without GPU resources' means this setting is required.

These parameters can be set either by exporting the environment variable or using the corresponding sbatch flag.

Workload Setup

Follow the steps below on your Slurm login node

Set your HF token and NGC key (see prerequisites for access):
```
export NGC_API_KEY=<key>
export HF_TOKEN=<token>
```
Set your STAGE_PATH, this is the location all artifacts for the benchmarks will be stored.
```
export STAGE_PATH=<path> (e.g. /lustre/project/...)
```
Run the setup script
- The script downloads the docker images from nvcr.io and saves them as .sqsh files in the $STAGE_PATH:
  - nvcr.io/nvidia/tritonserver:25.01-py3-sdk --> nvidia+tritonserver+25.01.sqsh
  - nvcr.io/nim/meta/llama3-70b-instruct:1.0.3 --> nim+meta+llama3-70b-instruct+1.0.3.sqsh
```
sbatch -A ${SBATCH_ACCOUNT} -p ${SBATCH_PARTITION} -N 1 ./setup.sh
```
Check the corresponding Slurm-<job_id>.out file for status information on downloading containers.

Download the Model

HuggingFace weights and model engine (this section can take a while)

bash prepare_models.sh

Result:

Getting files to download...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ • 131.4/131.4 GiB • Remaining: 0:00:00 • 166.8 MB/s • Elapsed: 0:14:23 • Total: 40 - Completed: 40 - Failed: 0

------------------------------------------------------------------------------------------------------
Download status: COMPLETED
Downloaded local path model: /lustre/../llama3-70b-instruct_vhf
Total files downloaded: 40
Total transferred: 131.43 GB
Started at: 2025-03-05 22:36:53
Completed at: 2025-03-05 22:51:17
Duration taken: 14m 23s

Getting files to download...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ • 68.4/68.4 GiB • Remaining: 0:00:00 • 141.4 MB/s • Elapsed: 0:07:44 • Total: 10 - Completed: 10 - Failed: 0

-----------------------------------------------------------------------------------------------------------------------------------------------
Download status: COMPLETED
Downloaded local path model: /lustre/.../llama3-70b-instruct_vhf/llama3-70b-instruct_v0.10.0+cbc614f5-h100x4-fp8-throughput
Total files downloaded: 10
Total transferred: 68.44 GB
Started at: 2025-03-05 22:51:23
Completed at: 2025-03-05 22:59:08
Duration taken: 7m 44s

If the engines appear to download instantly check that the Total transferred: <size>GB is the correct size, older NGC CLI versions will claim it downloaded successfully while only downloading a few KB. If so, install latest NGC CLI.

In addition you should see in your $STAGE_PATH a folder called llama3-70b-instruct_vhf, verify that the weights and engine look this below:

ls $STAGE_PATH/llama3-70b-instruct_vhf/

config.json						    model-00011-of-00030.safetensors  model-00025-of-00030.safetensors
generation_config.json					    model-00012-of-00030.safetensors  model-00026-of-00030.safetensors
LICENSE							    model-00013-of-00030.safetensors  model-00027-of-00030.safetensors
llama3-70b-instruct_v0.10.0+cbc614f5-h100x4-fp8-throughput  model-00014-of-00030.safetensors  model-00028-of-00030.safetensors
model-00001-of-00030.safetensors			    model-00015-of-00030.safetensors  model-00029-of-00030.safetensors
model-00002-of-00030.safetensors			    model-00016-of-00030.safetensors  model-00030-of-00030.safetensors
model-00003-of-00030.safetensors			    model-00017-of-00030.safetensors  model.safetensors.index.json
model-00004-of-00030.safetensors			    model-00018-of-00030.safetensors  README.md
model-00005-of-00030.safetensors			    model-00019-of-00030.safetensors  special_tokens_map.json
model-00006-of-00030.safetensors			    model-00020-of-00030.safetensors  tokenizer_config.json
model-00007-of-00030.safetensors			    model-00021-of-00030.safetensors  tokenizer.json
model-00008-of-00030.safetensors			    model-00022-of-00030.safetensors  trtllm_engine
model-00009-of-00030.safetensors			    model-00023-of-00030.safetensors  USE_POLICY.md
model-00010-of-00030.safetensors			    model-00024-of-00030.safetensors

ls $STAGE_PATH/llama3-70b-instruct_vhf/trtllm_engine/

checksums.blake3  LICENSE.txt	metadata.json  rank0.engine  rank2.engine  trt_llm_config.yaml
config.json  NOTICE.txt     rank1.engine  rank3.engine

Run Inference

Run sbatch.sh script
```
sbatch -A ${SBATCH_ACCOUNT} -p ${SBATCH_PARTITION} -N 1 sbatch.sh
```
You should then see

Submitted batch job <job-id>
- This will launch the NIM Server and Benchmarking containers. Once the server is ready, the benchmarking container will run a sweep of GenAI-Perf commands
- Depending on your Slurm system you may require changes in your sbatch parameters
- You can track the status of your job-id with squeue --job <job-id>

Once your job starts, check progress of your script by viewing the server and benchmarking logs of your job-id

tail -f $STAGE_PATH/logs/*<job-id>.out

You should see the progress of these output logs

==> logs/server_<job-id>.out <==
INFO 02-28 15:38:31.420 server.py:82] Started server process [4179797]
INFO 02-28 15:38:31.421 on.py:48] Waiting for application startup.
INFO 02-28 15:38:31.422 on.py:62] Application startup complete.
INFO 02-28 15:38:31.424 server.py:214] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

==> logs/benchmarking-<job-id>.out <==
                              NVIDIA GenAI-Perf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                         Statistic ┃    avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│          Time to first token (ms) │  16.26 │  12.39 │  17.25 │  17.09 │  16.68 │  16.56 │
│          Inter token latency (ms) │   1.85 │   1.55 │   2.04 │   2.02 │   1.97 │   1.92 │
│              Request latency (ms) │ 499.20 │ 451.01 │ 554.61 │ 548.69 │ 526.13 │ 514.19 │
│            Output sequence length │ 261.90 │ 256.00 │ 298.00 │ 296.60 │ 270.00 │ 265.00 │
│             Input sequence length │ 550.06 │ 550.00 │ 553.00 │ 551.60 │ 550.00 │ 550.00 │
│ Output token throughput (per sec) │ 520.87 │    N/A │    N/A │    N/A │    N/A │    N/A │
│      Request throughput (per sec) │   1.99 │    N/A │    N/A │    N/A │    N/A │    N/A │
└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘

Logs for benchmarking benchmarking_<job-id>.out, and the NIM inference service server_<job-id>.out will appear in your $STAGE_PATH/logs/
CSV results per GenAI-Perf command will appear in the path/to/results

Example CSV Path:
${STAGE_PATH}/results/llama-3-70b-instruct_{GPUs}_{concurrency}_{use_case}_{ISL}_{OSL}_{job-id}/profile_export_genai_perf.csv

Example CSV Data: