The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.
The following section shows how to run benchmarks measuring the model performance in training and inference mode.
To benchmark the training performance on a specific batch size, run:
Tacotron 2
For 1 GPU
python train.py -m Tacotron2 -o <output_dir> -lr 1e-3 --epochs 10 -bs <batch_size> --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --log-file nvlog.json --load-mel-from-disk --training-files=filelists/ljs_mel_text_train_subset_2500_filelist.txt --validation-files=filelists/ljs_mel_text_val_filelist.txt --dataset-path <dataset-path> --amp
python train.py -m Tacotron2 -o <output_dir> -lr 1e-3 --epochs 10 -bs <batch_size> --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --log-file nvlog.json --load-mel-from-disk --training-files=filelists/ljs_mel_text_train_subset_2500_filelist.txt --validation-files=filelists/ljs_mel_text_val_filelist.txt --dataset-path <dataset-path>
For multiple GPUs
python -m multiproc train.py -m Tacotron2 -o <output_dir> -lr 1e-3 --epochs 10 -bs <batch_size> --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --log-file nvlog.json --load-mel-from-disk --training-files=filelists/ljs_mel_text_train_subset_2500_filelist.txt --validation-files=filelists/ljs_mel_text_val_filelist.txt --dataset-path <dataset-path> --amp
python -m multiproc train.py -m Tacotron2 -o <output_dir> -lr 1e-3 --epochs 10 -bs <batch_size> --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --log-file nvlog.json --load-mel-from-disk --training-files=filelists/ljs_mel_text_train_subset_2500_filelist.txt --validation-files=filelists/ljs_mel_text_val_filelist.txt --dataset-path <dataset-path>
WaveGlow
For 1 GPU
python train.py -m WaveGlow -o <output_dir> -lr 1e-4 --epochs 10 -bs <batch_size> --segment-length 8000 --weight-decay 0 --grad-clip-thresh 65504.0 --cudnn-enabled --cudnn-benchmark --log-file nvlog.json --training-files filelists/ljs_audio_text_train_subset_1250_filelist.txt --dataset-path <dataset-path> --amp
python train.py -m WaveGlow -o <output_dir> -lr 1e-4 --epochs 10 -bs <batch_size> --segment-length 8000 --weight-decay 0 --grad-clip-thresh 3.4028234663852886e+38 --cudnn-enabled --cudnn-benchmark --log-file nvlog.json --training-files filelists/ljs_audio_text_train_subset_1250_filelist.txt --dataset-path <dataset-path>
For multiple GPUs
python -m multiproc train.py -m WaveGlow -o <output_dir> -lr 1e-4 --epochs 10 -bs <batch_size> --segment-length 8000 --weight-decay 0 --grad-clip-thresh 65504.0 --cudnn-enabled --cudnn-benchmark --log-file nvlog.json --training-files filelists/ljs_audio_text_train_subset_1250_filelist.txt --dataset-path <dataset-path> --amp
python -m multiproc train.py -m WaveGlow -o <output_dir> -lr 1e-4 --epochs 10 -bs <batch_size> --segment-length 8000 --weight-decay 0 --grad-clip-thresh 3.4028234663852886e+38 --cudnn-enabled --cudnn-benchmark --log-file nvlog.json --training-files filelists/ljs_audio_text_train_subset_1250_filelist.txt --dataset-path <dataset-path>
Each of these scripts runs for 10 epochs and for each epoch measures the
average number of items per second. The performance results can be read from
the nvlog.json
files produced by the commands.
To benchmark the inference performance on a batch size=1, run:
python inference.py --tacotron2 <Tacotron2_checkpoint> --waveglow <WaveGlow_checkpoint> -o output/ --include-warmup -i phrases/phrase_1_64.txt --fp16 --log-file=output/nvlog_fp16.json
python inference.py --tacotron2 <Tacotron2_checkpoint> --waveglow <WaveGlow_checkpoint> -o output/ --include-warmup -i phrases/phrase_1_64.txt --log-file=output/nvlog_fp32.json
The output log files will contain performance numbers for Tacotron 2 model
(number of output mel-spectrograms per second, reported as tacotron2_items_per_sec
)
and for WaveGlow (number of output samples per second, reported as waveglow_items_per_sec
).
The inference.py
script will run a few warmup iterations before running the benchmark.
The following sections provide details on how we achieved our performance and accuracy in training and inference.
Our results were obtained by running the ./platform/DGXA100_{tacotron2,waveglow}_{AMP,TF32}_{1,4,8}NGPU_train.sh
training script in the PyTorch-20.06-py3 NGC container on
NVIDIA DGX A100 (8x A100 40GB) GPUs.
All of the results were produced using the train.py
script as described in the
Training process section of this document. For each model,
the loss is taken from a sample run.
Loss (Model/Epoch) | 1 | 250 | 500 | 750 | 1000 |
---|---|---|---|---|---|
Tacotron 2 FP16 | 3.82 | 0.56 | 0.42 | 0.38 | 0.35 |
Tacotron 2 TF32 | 3.50 | 0.54 | 0.41 | 0.37 | 0.35 |
WaveGlow FP16 | -3.31 | -5.72 | -5.87 | -5.94 | -5.99 |
WaveGlow TF32 | -4.46 | -5.93 | -5.98 |
Figure 4. Tacotron 2 FP16 loss - batch size 128 (sample run)
Figure 5. Tacotron 2 TF32 loss - batch size 128 (sample run)
Figure 6. WaveGlow FP16 loss - batch size 10 (sample run)
Figure 7. WaveGlow TF32 loss - batch size 4 (sample run)
Our results were obtained by running the ./platform/DGX1_{tacotron2,waveglow}_{AMP,TF32}_{1,4,8}NGPU_train.sh
training script in the PyTorch-20.06-py3 NGC container on
NVIDIA DGX-1 with 8x V100 16G GPUs.
All of the results were produced using the train.py
script as described in the
Training process section of this document.
Loss (Model/Epoch) | 1 | 250 | 500 | 750 | 1000 |
---|---|---|---|---|---|
Tacotron 2 FP16 | 13.0732 | 0.5736 | 0.4408 | 0.3923 | 0.3735 |
Tacotron 2 FP32 | 8.5776 | 0.4807 | 0.3875 | 0.3421 | 0.3308 |
WaveGlow FP16 | -2.2054 | -5.7602 | -5.901 | -5.9706 | -6.0258 |
WaveGlow FP32 | -3.0327 | -5.858 | -6.0056 | -6.0613 | -6.1087 |
Figure 4. Tacotron 2 FP16 loss - batch size 104 (mean and std over 16 runs)
Figure 5. Tacotron 2 FP32 loss - batch size 48 (mean and std over 16 runs)
Figure 6. WaveGlow FP16 loss - batch size 10 (mean and std over 16 runs)
Figure 7. WaveGlow FP32 loss - batch size 4 (mean and std over 16 runs)
Figure 3. Tacotron 2 and WaveGlow training loss.
Our results were obtained by running the ./platform/DGXA100_{tacotron2,waveglow}_{AMP,TF32}_{1,4,8}NGPU_train.sh
training script in the [framework-container-name] NGC container on
NVIDIA DGX A100 (8x A100 40GB) GPUs. Performance numbers (in output mel-spectrograms per second for
Tacotron 2 and output samples per second for WaveGlow)
were averaged over an entire training epoch.
This table shows the results for Tacotron 2:
Number of GPUs | Batch size per GPU | Number of mels used with mixed precision | Number of mels used with TF32 | Speed-up with mixed precision | Multi-GPU weak scaling with mixed precision | Multi-GPU weak scaling with TF32 |
---|---|---|---|---|---|---|
1 | 128 | 26,484 | 31,499 | 0.84 | 1.00 | 1.00 |
4 | 128 | 107,482 | 124,591 | 0.86 | 4.06 | 3.96 |
8 | 128 | 209,186 | 250,556 | 0.83 | 7.90 | 7.95 |
The following table shows the results for WaveGlow:
Number of GPUs | Batch size per GPU | Number of samples used with mixed precision | Number of samples used with TF32 | Speed-up with mixed precision | Multi-GPU weak scaling with mixed precision | Multi-GPU weak scaling with TF32 |
---|---|---|---|---|---|---|
1 | 10@FP16, 4@TF32 | 149,479 | 67,581 | 2.21 | 1.00 | 1.00 |
4 | 10@FP16, 4@TF32 | 532,363 | 233,846 | 2.28 | 3.56 | 3.46 |
8 | 10@FP16, 4@TF32 | 905,043 | 383,043 | 2.36 | 6.05 | 5.67 |
The following table shows the expected training time for convergence for Tacotron 2 (1501 epochs):
Number of GPUs | Batch size per GPU | Time to train with mixed precision (Hrs) | Time to train with TF32 (Hrs) | Speed-up with mixed precision |
---|---|---|---|---|
1 | 128 | 112 | 94 | 0.84 |
4 | 128 | 29 | 25 | 0.87 |
8 | 128 | 16 | 14 | 0.84 |
The following table shows the expected training time for convergence for WaveGlow (1001 epochs):
Number of GPUs | Batch size per GPU | Time to train with mixed precision (Hrs) | Time to train with TF32 (Hrs) | Speed-up with mixed precision |
---|---|---|---|---|
1 | 10@FP16, 4@TF32 | 188 | 416 | 2.21 |
4 | 10@FP16, 4@TF32 | 54 | 122 | 2.27 |
8 | 10@FP16, 4@TF32 | 33 | 75 | 2.29 |
Our results were obtained by running the ./platform/DGX1_{tacotron2,waveglow}_{AMP,TF32}_{1,4,8}NGPU_train.sh
training script in the PyTorch-20.06-py3 NGC container on NVIDIA DGX-1 with
8x V100 16G GPUs. Performance numbers (in output mel-spectrograms per second for
Tacotron 2 and output samples per second for WaveGlow) were averaged over
an entire training epoch.
This table shows the results for Tacotron 2:
Number of GPUs | Batch size per GPU | Number of mels used with mixed precision | Number of mels used with FP32 | Speed-up with mixed precision | Multi-GPU weak scaling with mixed precision | Multi-GPU weak scaling with FP32 |
---|---|---|---|---|---|---|
1 | 104@FP16, 48@FP32 | 15,891 | 9,174 | 1.73 | 1.00 | 1.00 |
4 | 104@FP16, 48@FP32 | 53,417 | 32,035 | 1.67 | 3.36 | 3.49 |
8 | 104@FP16, 48@FP32 | 115,032 | 58,703 | 1.96 | 7.24 | 6.40 |
The following table shows the results for WaveGlow:
Number of GPUs | Batch size per GPU | Number of samples used with mixed precision | Number of samples used with FP32 | Speed-up with mixed precision | Multi-GPU weak scaling with mixed precision | Multi-GPU weak scaling with FP32 |
---|---|---|---|---|---|---|
1 | 10@FP16, 4@FP32 | 105,873 | 33,761 | 3.14 | 1.00 | 1.00 |
4 | 10@FP16, 4@FP32 | 364,471 | 118,254 | 3.08 | 3.44 | 3.50 |
8 | 10@FP16, 4@FP32 | 690,909 | 222,794 | 3.10 | 6.53 | 6.60 |
To achieve these same results, follow the steps in the Quick Start Guide.
The following table shows the expected training time for convergence for Tacotron 2 (1501 epochs):
Number of GPUs | Batch size per GPU | Time to train with mixed precision (Hrs) | Time to train with FP32 (Hrs) | Speed-up with mixed precision |
---|---|---|---|---|
1 | 104@FP16, 48@FP32 | 181 | 333 | 1.84 |
4 | 104@FP16, 48@FP32 | 53 | 88 | 1.66 |
8 | 104@FP16, 48@FP32 | 31 | 48 | 1.56 |
The following table shows the expected training time for convergence for WaveGlow (1001 epochs):
Number of GPUs | Batch size per GPU | Time to train with mixed precision (Hrs) | Time to train with FP32 (Hrs) | Speed-up with mixed precision |
---|---|---|---|---|
1 | 10@FP16, 4@FP32 | 249 | 793 | 3.18 |
4 | 10@FP16, 4@FP32 | 78 | 233 | 3.00 |
8 | 10@FP16, 4@FP32 | 48 | 127 | 2.98 |
The following tables show inference statistics for the Tacotron2 and WaveGlow text-to-speech system, gathered from 1000 inference runs, on 1x A100, 1x V100 and 1x T4, respectively. Latency is measured from the start of Tacotron 2 inference to the end of WaveGlow inference. The tables include average latency, latency standard deviation, and latency confidence intervals. Throughput is measured as the number of generated audio samples per second. RTF is the real-time factor which tells how many seconds of speech are generated in 1 second of compute.
Our results were obtained by running the inference-script-name.sh
inferencing
benchmarking script in the PyTorch-20.06-py3 NGC container on NVIDIA DGX A100 (1x A100 40GB) GPU.
Batch size | Input length | Precision | WN channels | Avg latency (s) | Latency std (s) | Latency confidence interval 50% (s) | Latency confidence interval 90% (s) | Latency confidence interval 95% (s) | Latency confidence interval 99% (s) | Throughput (samples/sec) | Speed-up with mixed precision | Avg mels generated (81 mels=1 sec of speech) | Avg audio length (s) | Avg RTF |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 128 | FP16 | 256 | 0.80 | 0.02 | 0.80 | 0.83 | 0.84 | 0.86 | 192,086 | 1.08 | 602 | 6.99 | 8.74 |
4 | 128 | FP16 | 256 | 1.05 | 0.03 | 1.05 | 1.09 | 1.10 | 1.13 | 602,856 | 1.20 | 619 | 7.19 | 6.85 |
1 | 128 | FP32 | 256 | 0.87 | 0.02 | 0.87 | 0.90 | 0.91 | 0.93 | 177,210 | 1.00 | 601 | 6.98 | 8.02 |
4 | 128 | FP32 | 256 | 1.27 | 0.03 | 1.26 | 1.31 | 1.32 | 1.35 | 500,458 | 1.00 | 620 | 7.20 | 5.67 |
1 | 128 | FP16 | 512 | 0.87 | 0.02 | 0.87 | 0.90 | 0.92 | 0.94 | 176,135 | 1.12 | 601 | 6.98 | 8.02 |
4 | 128 | FP16 | 512 | 1.37 | 0.03 | 1.36 | 1.42 | 1.43 | 1.45 | 462,691 | 1.32 | 619 | 7.19 | 5.25 |
1 | 128 | FP32 | 512 | 0.98 | 0.03 | 0.98 | 1.02 | 1.03 | 1.07 | 156,586 | 1.00 | 602 | 6.99 | 7.13 |
4 | 128 | FP32 | 512 | 1.81 | 0.05 | 1.79 | 1.86 | 1.90 | 1.93 | 351,465 | 1.00 | 620 | 7.20 | 3.98 |
Batch size | Input length | Precision | WN channels | Avg latency (s) | Latency std (s) | Latency confidence interval 50% (s) | Latency confidence interval 90% (s) | Latency confidence interval 95% (s) | Latency confidence interval 99% (s) | Throughput (samples/sec) | Speed-up with mixed precision | Avg mels generated (81 mels=1 sec of speech) | Avg audio length (s) | Avg RTF |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 128 | FP16 | 256 | 1.14 | 0.07 | 1.12 | 1.20 | 1.33 | 1.40 | 136,069 | 1.58 | 602 | 6.99 | 6.13 |
4 | 128 | FP16 | 256 | 1.52 | 0.05 | 1.52 | 1.58 | 1.61 | 1.65 | 416,688 | 1.72 | 619 | 7.19 | 4.73 |
1 | 128 | FP32 | 256 | 1.79 | 0.06 | 1.78 | 1.86 | 1.89 | 1.99 | 86,175 | 1.00 | 602 | 6.99 | 3.91 |
4 | 128 | FP32 | 256 | 2.61 | 0.07 | 2.61 | 2.71 | 2.74 | 2.78 | 242,656 | 1.00 | 619 | 7.19 | 2.75 |
1 | 128 | FP16 | 512 | 1.25 | 0.08 | 1.23 | 1.32 | 1.44 | 1.50 | 124,057 | 1.90 | 602 | 6.99 | 5.59 |
4 | 128 | FP16 | 512 | 2.11 | 0.06 | 2.10 | 2.19 | 2.22 | 2.29 | 300,505 | 2.37 | 620 | 7.20 | 3.41 |
1 | 128 | FP32 | 512 | 2.36 | 0.08 | 2.35 | 2.46 | 2.54 | 2.61 | 65,239 | 1.00 | 601 | 6.98 | 2.96 |
4 | 128 | FP32 | 512 | 5.00 | 0.14 | 4.96 | 5.18 | 5.26 | 5.42 | 126,810 | 1.00 | 618 | 7.18 | 1.44 |
Batch size | Input length | Precision | WN channels | Avg latency (s) | Latency std (s) | Latency confidence interval 50% (s) | Latency confidence interval 90% (s) | Latency confidence interval 95% (s) | Latency confidence interval 99% (s) | Throughput (samples/sec) | Speed-up with mixed precision | Avg mels generated (81 mels=1 sec of speech) | Avg audio length (s) | Avg RTF |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 128 | FP16 | 256 | 1.23 | 0.05 | 1.22 | 1.29 | 1.33 | 1.42 | 125,397 | 2.46 | 602 | 6.99 | 5.68 |
4 | 128 | FP16 | 256 | 2.85 | 0.08 | 2.84 | 2.96 | 2.99 | 3.07 | 222,672 | 1.90 | 620 | 7.20 | 2.53 |
1 | 128 | FP32 | 256 | 3.03 | 0.10 | 3.02 | 3.14 | 3.19 | 3.32 | 50,900 | 1.00 | 602 | 6.99 | 2.31 |
4 | 128 | FP32 | 256 | 5.41 | 0.15 | 5.38 | 5.61 | 5.66 | 5.85 | 117,325 | 1.00 | 620 | 7.20 | 1.33 |
1 | 128 | FP16 | 512 | 1.75 | 0.08 | 1.73 | 1.87 | 1.91 | 1.98 | 88,319 | 2.79 | 602 | 6.99 | 4.00 |
4 | 128 | FP16 | 512 | 4.59 | 0.13 | 4.57 | 4.77 | 4.83 | 4.94 | 138,226 | 2.84 | 620 | 7.20 | 1.57 |
1 | 128 | FP32 | 512 | 4.87 | 0.14 | 4.86 | 5.03 | 5.13 | 5.27 | 31,630 | 1.00 | 602 | 6.99 | 1.44 |
4 | 128 | FP32 | 512 | 13.02 | 0.37 | 12.96 | 13.53 | 13.67 | 14.13 | 48,749 | 1.00 | 620 | 7.20 | 0.55 |
Our results were obtained by running the ./run_latency_tests.sh
script in
the PyTorch-20.06-py3 NGC container. Please note that to reproduce the results,
you need to provide pretrained checkpoints for Tacotron 2 and WaveGlow. Please
edit the script to provide your checkpoint filenames.
To compare with inference performance on CPU with TorchScript, benchmark inference on CPU using ./run_latency_tests_cpu.sh
script and get the performance numbers for batch size 1 and 4. Intel's optimization for PyTorch on CPU are added, you need to set export OMP_NUM_THREADS=<num physical cores>
based on your CPU's core number, for your reference: https://software.intel.com/content/www/us/en/develop/articles/maximize-tensorflow-performance-on-cpu-considerations-and-recommendations-for-inference.html