Jasper for PyTorch | NVIDIA NGC

NVIDIA Deep Learning Examples

Jasper for PyTorch

Resource

NVIDIA Deep Learning Examples

Jasper for PyTorch

The Jasper model is an end-to-end neural acoustic model for automatic speech recognition (ASR).

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

To benchmark the training performance in a specific setting on the train-clean-100 subset of LibriSpeech, run:

BATCH_SIZE_SEQ=<BATCH_SIZES> NUM_GPUS_SEQ=<NUMS_OF_GPUS> bash scripts/train_benchmark.sh

By default, this script runs 2 epochs on the configuration configs/jasper10x5dr_speedp-online_train-benchmark.yaml, which applies gentle speed perturbation that does not change the length of the output, enabling immediate stabilization of training step times in the cuDNN benchmark mode. The script runs benchmarks on batch sizes 32 on 1, 4, and 8 GPUs, and requires a 8x 32GB GPU machine.

Inference performance benchmark

To benchmark the inference performance on a specific batch size and audio length, run:

BATCH_SIZE_SEQ=<BATCH_SIZES> MAX_DURATION_SEQ=<DURATIONS> bash scripts/inference_benchmark.sh

By default, the script runs on a single GPU and evaluates on the dataset limited to utterances shorter than MAX_DURATION. It uses the model configuration configs/jasper10x5dr_speedp-online_speca.yaml.

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference. All results are trained on 960 hours of LibriSpeech with a maximum audio length of 16.7s. The training is evaluated on LibriSpeech dev-clean, dev-other, test-clean, test-other. Checkpoints for evaluation are being chosen based on their word error rate on dev-clean.

Training accuracy results

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the scripts/train.sh training script in the PyTorch 20.10-py3 NGC container with NVIDIA DGX A100 with (8x A100 80GB) GPUs. The following table reports the word error rate (WER) of the acoustic model with greedy decoding on all LibriSpeech dev and test datasets for mixed precision training.

Number of GPUs	Batch size per GPU	Precision	dev-clean WER	dev-other WER	test-clean WER	test-other WER	Time to train
8	64	mixed	3.20	9.78	3.41	9.71	70 h

Training accuracy: NVIDIA DGX-1 (8x V100 32GB)

Our results were obtained by running the scripts/train.sh training script in the PyTorch 20.10-py3 NGC container with NVIDIA DGX-1 with (8x V100 32GB) GPUs. The following table reports the word error rate (WER) of the acoustic model with greedy decoding on all LibriSpeech dev and test datasets for mixed precision training.

Number of GPUs	Batch size per GPU	Precision	dev-clean WER	dev-other WER	test-clean WER	test-other WER	Time to train
8	64	mixed	3.26	10.00	3.54	9.80	130 h

We show the best of 5 runs (mixed precision) and 2 runs (FP32) chosen based on dev-clean WER. For FP32, two gradient accumulation steps have been used.

Training stability test

The following table compares greedy decoding word error rates across 8 different training runs with different seeds for mixed precision training.

DGX A100 80GB, FP16, 8x GPU	Seed #1	Seed #2	Seed #3	Seed #4	Seed #5	Seed #6	Seed #7	Seed #8	Mean	Std
dev-clean	3.46	3.55	3.45	3.44	3.25	3.34	3.20	3.40	3.39	0.11
dev-other	10.30	10.77	10.36	10.26	9.99	10.18	9.78	10.32	10.25	0.27
test-clean	3.84	3.81	3.66	3.64	3.58	3.55	3.41	3.73	3.65	0.13
test-other	10.61	10.52	10.49	10.47	9.89	10.09	9.71	10.26	10.26	0.31

DGX-1 32GB, FP16, 8x GPU	Seed #1	Seed #2	Seed #3	Seed #4	Seed #5	Seed #6	Seed #7	Seed #8	Mean	Std
dev-clean	3.31	3.31	3.26	3.44	3.40	3.35	3.36	3.28	3.34	0.06
dev-other	10.02	10.01	10.00	10.06	10.05	10.03	10.10	10.04	10.04	0.03
test-clean	3.49	3.50	3.54	3.61	3.57	3.58	3.48	3.51	3.54	0.04
test-other	10.11	10.14	9.80	10.09	10.17	9.99	9.86	10.00	10.02	0.13

Training performance results

Our results were obtained by running the scripts/train.sh training script in the PyTorch 20.10-py3 NGC container. Performance (in sequences per second) is the steady-state throughput.

Training performance: NVIDIA DGX A100 (8x A100 80GB)

Batch size / GPU	GPUs	Throughput - TF32	Throughput - mixed precision	Throughput speedup (TF32 to mixed precision)	Weak scaling - TF32	Weak scaling - mixed precision
32	1	42.18	64.32	1.52	1.00	1.00
32	4	157.49	239.23	1.52	3.73	3.72
32	8	310.10	470.09	1.52	7.35	7.31
64	1	49.64	75.59	1.52	1.00	1.00
64	4	192.66	289.16	1.50	3.88	3.83
64	8	371.41	547.91	1.48	7.48	7.25

Note: Mixed precision permits higher batch sizes during training. We report the maximum batch sizes (as powers of 2), which are allowed without gradient accumulation.