NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

This recipe contains information and scripts to produce performance results for the Mistral Hugging Face fine-tuning training workload using PEFT and FSDP.

Publisher

Latest Version

24.08.1

Modified

December 23, 2024

Compressed Size

7.07 KB

Overview

This recipe contains information and scripts to produce performance results for the Mistral Hugging Face fine-tuning training workload using PEFT and FSDP. The scripts help perform environment setup, dataset setup, and launch benchmark jobs. This variant of the workload is best-suited for GPU clusters with:

At least 8 GPUs with at least 80 GB memory each. Fine tuning of this 7-billion parameter variant of the workload will not fit on fewer GPUs with less memory.
H100 GPUs. This workload runs with BF16, which is supported by H100 GPUs.

Expected Performance

Performance for HF Mistral fine tuning is measured by train samples per second, which is logged in the .out file associated with the job.

grep train_samples_per_second log-hf-mistral_7b_32_peft_fsdp_656947.out
{'train_runtime': 2950.1412, 'train_samples_per_second': 555.363, 'train_steps_per_second': 0.034, 'train_loss': 1.0721950674057006, 'epoch': 6.25}

Mistral 7b BF16	Train samples per second on 8x H100 GPUs	Train samples per second on 16x H100 GPUs	Train samples per second on 32x H100 GPUs	Train samples per second on 64x H100 GPUs	Train samples per second on 128x H100 GPUs	Train samples per second on 256x H100 GPUs
Training samples per second	16.287	37.895	81.626	161.273	308.95	555.363

Prerequisites

This recipe requires access to Hugging Face Mistral. Instructions are below if needed.

Prepare Environment

Create a staging area by running the setup.sh script. The script converts the docker image from nvcr.io/nvidia/pytorch:24.02.framework to the nvidia+pytorch+24.02.framework.squash file under the $STAGE_PATH folder and downloads DHS-LLM workshop source code.

# Set the path where all artifacts will be downloaded
export STAGE_PATH=<path to your shared file system folder> (e.g. /lustre/myproject/<userid>)
# Set the Slurm partition to use
export SLURM_PARTITION="batch"
# Set the Slurm account to use
export SLURM_ACCOUNT="account_name"

# Run the setup
bash ./setup.sh

Request Access

Access to Mistral 7B must be requested on Hugging Face Mistral 7B.

Prepare Dataset

To download the model and dataset you will need to create a Hugging Face access token with READ privileges. You will use your HF user name and access token as the user/password for the git clones. For more information see: https://huggingface.co/docs/hub/en/security-tokens

Download the model:

Note: Cloning the model can take well over awhile, and you will be prompted twice for user/password. After the second prompt it'll appear as if it's hung.

cd $STAGE_PATH

# Only needs to be performed once
git lfs install

git clone https://huggingface.co/mistralai/Mistral-7B-v0.1

If the model download step was successful there should these files in the $STAGE_PATH/Mistral-7B-v0.1 folder.

Download the dataset:

git clone https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k

If the dataset clone step was successful there should these files in the $STAGE_PATH/ultrachat_200k/data folder

Run Training

Once the environment has been prepared, it is time to train a model. Run the launch_hf_mistral_7b_peft_fsdp.sh script with sbatch for launching Hugging Face MISTRAL 7b model training on 1 to 64 nodes with BF16 precision. Log files will be located under ${STAGE_PATH}/log-hf-mistral_7b_<num nodes>_peft_fsdp_<job id>.out.

# Add -J <job name and/or -A <account name> and/or -p <partition> and/or --gres gpu:8 to the sbatch command if needed
sbatch -N 8 launch_hf_mistral_7b_peft_fsdp.sh

Notes

accelerate launches on every node and pip install requirements.txt is run as part of srun command to ensure compute nodes have same enviroment. PYTHONPATH is set for this.

Mistral Hugging Face (DGXC Benchmarking)