This recipe contains information and scripts to produce performance results for the Llama 2 Hugging Face fine-tuning training workload using PEFT and FSDP. The scripts help perform environment setup, dataset setup, and launch benchmark jobs. This variant of the workload is best-suited for GPU clusters with
Performance for HF Llama 2 fine tuning is measured by train samples per second, which is logged in the .out file associated with the job.
grep train_samples_per_second log-hf-llama2_70b_32_peft_fsdp_652934.out
{'train_runtime': 2577.7505, 'train_samples_per_second': 95.339, 'train_steps_per_second': 0.012, 'train_loss': 1.0156359354654947, 'epoch': 0.9}
wandb: train_samples_per_second 95.339
LLAMA2 70b BF16 | Train samples per second on 8x H100 GPUs | Train samples per second on 16x H100 GPUs | Train samples per second on 32x H100 GPUs | Train samples per second on 64x H100 GPUs | Train samples per second on 128x H100 GPUs | Train samples per second on 256x H100 GPUs | Train samples per second on 512x H100 GPUs |
---|---|---|---|---|---|---|---|
Training samples per second | 1.554 | 3.664 | 9.432 | 21.905 | 47.342 | 95.339 | 154.931 |
This recipe requires access to Hugging Face Llama 2. Instructions are below if needed.
Create a staging area by running the setup.sh script. The script converts the docker image from nvcr.io/nvidia/pytorch:24.02.framework to the nvidia+pytorch+24.02.framework.squash file under the $STAGE_PATH folder and downloads DHS-LLM workshop source code.
# Set the path where all artifacts will be downloaded
export STAGE_PATH=<path to your shared file system folder> (e.g. /lustre/myproject/<userid>)
# Set the Slurm partition to use
export SLURM_PARTITION="batch"
# Set the Slurm account to use
export SLURM_ACCOUNT="account_name"
# Run the setup
bash ./setup.sh
Access to Llama 2 must be requested through Meta's website then requested on the Hugging Face Llama page. The approval process is not automatic and could take a day or more.
To download the model and dataset you will need to create a Hugging Face access token with READ privileges. You will use your HF user name and access token as the user/password for the git clones. For more information see: https://huggingface.co/docs/hub/en/security-tokens
Note: Cloning the model can take well over an hour, and you will be prompted twice for user/password. After the second prompt it'll appear as if it's hung.
cd $STAGE_PATH
# Only needs to be peformed once
git lfs install
git clone https://huggingface.co/meta-llama/Llama-2-70b-hf
If the model download step was successful there should these files in the $STAGE_PATH/Llama-2-70b-hf folder.
git clone https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
If the dataset clone step was successful there should be these files in the $STAGE_PATH/ultrachat_200k/data folder
More information on the model and dataset can be found https://huggingface.co/meta-llama/Llama-2-70b-hf and https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k respectively.
Once the environment has been prepared, it is time to train a model. Run the launch_hf_llama2_70b_peft_fsdp.sh script with sbatch for launching Hugging Face LLAMA2 70b model training on 1 to 64 nodes with BF16 precision.
Log files will be located under ${STAGE_PATH}/log-hf-llama2_70b_<num nodes>_peft_fsdp_<job id>.out
.
# Add -J <job name and/or -A <account name> and/or -p <partition> and/or --gres gpu:8 to the sbatch command if needed
sbatch -N 8 launch_hf_llama2_70b_peft_fsdp.sh
accelerate launches on every node and pip install requirements.txt is run as part of srun command to ensure compute nodes have same enviroment. PYTHONPATH is set for this.