This recipe contains information and scripts to produce performance results for the Mistral Hugging Face fine-tuning training workload using PEFT and FSDP. The scripts help perform environment setup, dataset setup, and launch benchmark jobs. This variant of the workload is best-suited for GPU clusters with:
Performance for HF Mistral fine tuning is measured by train samples per second, which is logged in the .out file associated with the job.
grep train_samples_per_second log-hf-mistral_7b_32_peft_fsdp_656947.out
{'train_runtime': 2950.1412, 'train_samples_per_second': 555.363, 'train_steps_per_second': 0.034, 'train_loss': 1.0721950674057006, 'epoch': 6.25}
Mistral 7b BF16 | Train samples per second on 8x H100 GPUs | Train samples per second on 16x H100 GPUs | Train samples per second on 32x H100 GPUs | Train samples per second on 64x H100 GPUs | Train samples per second on 128x H100 GPUs | Train samples per second on 256x H100 GPUs |
---|---|---|---|---|---|---|
Training samples per second | 16.287 | 37.895 | 81.626 | 161.273 | 308.95 | 555.363 |
This recipe requires access to Hugging Face Mistral. Instructions are below if needed.
Create a staging area by running the setup.sh script. The script converts the docker image from nvcr.io/nvidia/pytorch:24.02.framework to the nvidia+pytorch+24.02.framework.squash file under the $STAGE_PATH folder and downloads DHS-LLM workshop source code.
# Set the path where all artifacts will be downloaded
export STAGE_PATH=<path to your shared file system folder> (e.g. /lustre/myproject/<userid>)
# Set the Slurm partition to use
export SLURM_PARTITION="batch"
# Set the Slurm account to use
export SLURM_ACCOUNT="account_name"
# Run the setup
bash ./setup.sh
Access to Mistral 7B must be requested on Hugging Face Mistral 7B.
To download the model and dataset you will need to create a Hugging Face access token with READ privileges. You will use your HF user name and access token as the user/password for the git clones. For more information see: https://huggingface.co/docs/hub/en/security-tokens
Note: Cloning the model can take well over awhile, and you will be prompted twice for user/password. After the second prompt it'll appear as if it's hung.
cd $STAGE_PATH
# Only needs to be performed once
git lfs install
git clone https://huggingface.co/mistralai/Mistral-7B-v0.1
If the model download step was successful there should these files in the $STAGE_PATH/Mistral-7B-v0.1 folder.
git clone https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
If the dataset clone step was successful there should these files in the $STAGE_PATH/ultrachat_200k/data folder
Once the environment has been prepared, it is time to train a model. Run the launch_hf_mistral_7b_peft_fsdp.sh script with sbatch for launching Hugging Face MISTRAL 7b model training on 1 to 64 nodes with BF16 precision.
Log files will be located under ${STAGE_PATH}/log-hf-mistral_7b_<num nodes>_peft_fsdp_<job id>.out
.
# Add -J <job name and/or -A <account name> and/or -p <partition> and/or --gres gpu:8 to the sbatch command if needed
sbatch -N 8 launch_hf_mistral_7b_peft_fsdp.sh
accelerate launches on every node and pip install requirements.txt is run as part of srun command to ensure compute nodes have same enviroment. PYTHONPATH is set for this.