NGC | Catalog
CatalogResourcesJasper for PyTorch

Jasper for PyTorch

For downloads and more information, please view on a desktop device.
Logo for Jasper for PyTorch

Description

The Jasper model is an end-to-end neural acoustic model for automatic speech recognition (ASR).

Publisher

NVIDIA

Latest Version

20.10.12

Modified

April 4, 2023

Compressed Size

3.76 MB

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

To benchmark the training performance in a specific setting on the train-clean-100 subset of LibriSpeech, run:

BATCH_SIZE_SEQ=<BATCH_SIZES> NUM_GPUS_SEQ=<NUMS_OF_GPUS> bash scripts/train_benchmark.sh

By default, this script runs 2 epochs on the configuration configs/jasper10x5dr_speedp-online_train-benchmark.yaml, which applies gentle speed perturbation that does not change the length of the output, enabling immediate stabilization of training step times in the cuDNN benchmark mode. The script runs benchmarks on batch sizes 32 on 1, 4, and 8 GPUs, and requires a 8x 32GB GPU machine.

Inference performance benchmark

To benchmark the inference performance on a specific batch size and audio length, run:

BATCH_SIZE_SEQ=<BATCH_SIZES> MAX_DURATION_SEQ=<DURATIONS> bash scripts/inference_benchmark.sh

By default, the script runs on a single GPU and evaluates on the dataset limited to utterances shorter than MAX_DURATION. It uses the model configuration configs/jasper10x5dr_speedp-online_speca.yaml.

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference. All results are trained on 960 hours of LibriSpeech with a maximum audio length of 16.7s. The training is evaluated on LibriSpeech dev-clean, dev-other, test-clean, test-other. Checkpoints for evaluation are being chosen based on their word error rate on dev-clean.

Training accuracy results

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

Our results were obtained by running the scripts/train.sh training script in the PyTorch 20.10-py3 NGC container with NVIDIA DGX A100 with (8x A100 80GB) GPUs. The following table reports the word error rate (WER) of the acoustic model with greedy decoding on all LibriSpeech dev and test datasets for mixed precision training.

Number of GPUs Batch size per GPU Precision dev-clean WER dev-other WER test-clean WER test-other WER Time to train
8 64 mixed 3.20 9.78 3.41 9.71 70 h
Training accuracy: NVIDIA DGX-1 (8x V100 32GB)

Our results were obtained by running the scripts/train.sh training script in the PyTorch 20.10-py3 NGC container with NVIDIA DGX-1 with (8x V100 32GB) GPUs. The following table reports the word error rate (WER) of the acoustic model with greedy decoding on all LibriSpeech dev and test datasets for mixed precision training.

Number of GPUs Batch size per GPU Precision dev-clean WER dev-other WER test-clean WER test-other WER Time to train
8 64 mixed 3.26 10.00 3.54 9.80 130 h

We show the best of 5 runs (mixed precision) and 2 runs (FP32) chosen based on dev-clean WER. For FP32, two gradient accumulation steps have been used.

Training stability test

The following table compares greedy decoding word error rates across 8 different training runs with different seeds for mixed precision training.

DGX A100 80GB, FP16, 8x GPU Seed #1 Seed #2 Seed #3 Seed #4 Seed #5 Seed #6 Seed #7 Seed #8 Mean Std
dev-clean 3.46 3.55 3.45 3.44 3.25 3.34 3.20 3.40 3.39 0.11
dev-other 10.30 10.77 10.36 10.26 9.99 10.18 9.78 10.32 10.25 0.27
test-clean 3.84 3.81 3.66 3.64 3.58 3.55 3.41 3.73 3.65 0.13
test-other 10.61 10.52 10.49 10.47 9.89 10.09 9.71 10.26 10.26 0.31
DGX-1 32GB, FP16, 8x GPU Seed #1 Seed #2 Seed #3 Seed #4 Seed #5 Seed #6 Seed #7 Seed #8 Mean Std
dev-clean 3.31 3.31 3.26 3.44 3.40 3.35 3.36 3.28 3.34 0.06
dev-other 10.02 10.01 10.00 10.06 10.05 10.03 10.10 10.04 10.04 0.03
test-clean 3.49 3.50 3.54 3.61 3.57 3.58 3.48 3.51 3.54 0.04
test-other 10.11 10.14 9.80 10.09 10.17 9.99 9.86 10.00 10.02 0.13

Training performance results

Our results were obtained by running the scripts/train.sh training script in the PyTorch 20.10-py3 NGC container. Performance (in sequences per second) is the steady-state throughput.

Training performance: NVIDIA DGX A100 (8x A100 80GB)
Batch size / GPU GPUs Throughput - TF32 Throughput - mixed precision Throughput speedup (TF32 to mixed precision) Weak scaling - TF32 Weak scaling - mixed precision
32 1 42.18 64.32 1.52 1.00 1.00
32 4 157.49 239.23 1.52 3.73 3.72
32 8 310.10 470.09 1.52 7.35 7.31
64 1 49.64 75.59 1.52 1.00 1.00
64 4 192.66 289.16 1.50 3.88 3.83
64 8 371.41 547.91 1.48 7.48 7.25

Note: Mixed precision permits higher batch sizes during training. We report the maximum batch sizes (as powers of 2), which are allowed without gradient accumulation.

To achieve these same results, follow the Quick Start Guide outlined above.

Training performance: NVIDIA DGX-1 (8x V100 16GB)
Batch size / GPU GPUs Throughput - FP32 Throughput - mixed precision Throughput speedup (FP32 to mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
16 1 10.71 27.87 2.60 1.00 1.00
16 4 40.28 99.80 2.48 3.76 3.58
16 8 78.23 193.89 2.48 7.30 6.96

Note: Mixed precision permits higher batch sizes during training. We report the maximum batch sizes (as powers of 2), which are allowed without gradient accumulation.

To achieve these same results, follow the Quick Start Guide outlined above.

Training performance: NVIDIA DGX-1 (8x V100 32GB)
Batch size / GPU GPUs Throughput - FP32 Throughput - mixed precision Throughput speedup (FP32 to mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
32 1 12.22 34.08 2.79 1.00 1.00
32 4 46.97 128.39 2.73 3.84 3.77
32 8 92.44 249.00 2.69 7.57 7.31
64 1 N/A 39.30 N/A N/A 1.00
64 4 N/A 150.18 N/A N/A 3.82
64 8 N/A 282.68 N/A N/A 7.19

Note: Mixed precision permits higher batch sizes during training. We report the maximum batch sizes (as powers of 2), which are allowed without gradient accumulation.

To achieve these same results, follow the Quick Start Guide outlined above.

Training performance: NVIDIA DGX-2 (16x V100 32GB)
Batch size / GPU GPUs Throughput - FP32 Throughput - mixed precision Throughput speedup (FP32 to mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
32 1 13.46 38.94 2.89 1.00 1.00
32 4 51.38 143.44 2.79 3.82 3.68
32 8 100.54 280.48 2.79 7.47 7.20
32 16 188.14 515.90 2.74 13.98 13.25
64 1 N/A 43.86 N/A N/A 1.00
64 4 N/A 165.27 N/A N/A 3.77
64 8 N/A 318.10 N/A N/A 7.25
64 16 N/A 567.47 N/A N/A 12.94

Note: Mixed precision permits higher batch sizes during training. We report the maximum batch sizes (as powers of 2), which are allowed without gradient accumulation.

To achieve these same results, follow the Quick Start Guide outlined above.

Inference performance results

Our results were obtained by running the scripts/inference_benchmark.sh script in the PyTorch 20.10-py3 NGC container on NVIDIA DGX A100, DGX-1, DGX-2 and T4 on a single GPU. Performance numbers (latency in milliseconds per batch) were averaged over 500 iterations.

Inference performance: NVIDIA DGX A100 (1x A100 80GB)
FP16 Latency (ms) Percentiles TF32 Latency (ms) Percentiles FP16/TF32 speed up
BS Duration (s) 90% 95% 99% Avg 90% 95% 99% Avg Avg
1 2.0 32.40 32.50 32.82 32.30 33.30 33.64 34.65 33.25 1.03
2 2.0 32.90 33.51 34.35 32.69 34.48 34.65 35.66 34.27 1.05
4 2.0 32.85 33.01 33.89 32.60 34.09 34.46 35.22 34.00 1.04
8 2.0 35.51 35.89 37.10 35.33 34.86 35.36 36.08 34.45 0.98
16 2.0 36.00 36.57 37.40 35.77 43.83 44.12 44.77 43.39 1.21
1 7.0 33.50 33.99 34.91 33.03 33.83 34.25 34.95 33.70 1.02
2 7.0 34.43 34.89 35.72 34.22 34.41 34.73 35.69 34.28 1.00
4 7.0 34.30 34.59 35.43 34.07 37.95 38.18 38.87 37.55 1.10
8 7.0 35.98 36.28 37.11 35.28 44.64 44.79 45.37 44.29 1.26
16 7.0 39.86 40.08 41.16 39.33 55.17 55.46 57.24 54.56 1.39
1 16.7 35.20 35.80 38.71 34.36 35.36 35.76 36.55 34.64 1.01
2 16.7 35.40 35.81 36.50 34.76 36.34 36.53 37.40 35.87 1.03
4 16.7 36.01 36.38 37.37 35.57 44.69 45.09 45.88 43.92 1.23
8 16.7 41.48 41.78 44.22 40.69 58.57 58.74 59.62 58.11 1.43
16 16.7 61.37 61.93 66.32 60.92 97.33 97.71 100.04 96.56 1.59

To achieve these same results, follow the Quick Start Guide outlined above.

Inference performance: NVIDIA DGX-1 (1x V100 16GB)
FP16 Latency (ms) Percentiles FP32 Latency (ms) Percentiles FP16/FP32 speed up
BS Duration (s) 90% 95% 99% Avg 90% 95% 99% Avg Avg
1 2.0 45.42 45.62 49.54 45.02 48.83 48.99 51.66 48.44 1.08
2 2.0 50.31 50.53 53.66 49.10 49.87 50.04 52.99 49.41 1.01
4 2.0 49.17 49.48 52.13 48.73 52.92 53.21 55.28 52.31 1.07
8 2.0 51.20 51.40 52.32 49.01 73.02 73.30 75.00 71.99 1.47
16 2.0 51.75 52.24 56.36 51.27 83.99 84.57 86.69 83.24 1.62
1 7.0 48.13 48.53 50.95 46.78 48.52 48.75 50.89 48.01 1.03
2 7.0 49.52 50.10 52.35 48.00 65.27 65.41 66.59 64.79 1.35
4 7.0 51.75 52.01 54.39 50.38 93.75 94.77 97.04 92.27 1.83
8 7.0 54.80 56.27 66.23 52.95 130.65 131.09 132.91 129.82 2.45
16 7.0 73.02 73.42 75.83 71.96 157.53 158.20 160.73 155.51 2.16
1 16.7 48.10 48.52 52.71 47.20 73.34 73.56 74.19 72.69 1.54
2 16.7 64.21 64.52 65.56 56.06 129.48 129.97 131.78 126.36 2.25
4 16.7 60.38 61.03 63.18 58.87 183.33 183.85 185.53 181.90 3.09
8 16.7 85.88 86.34 87.70 84.46 227.42 228.21 229.63 225.71 2.67
16 16.7 135.62 136.40 137.69 131.58 276.90 277.59 281.16 275.08 2.09

To achieve these same results, follow the Quick Start Guide outlined above.

Inference performance: NVIDIA DGX-1 (1x V100 32GB)
FP16 Latency (ms) Percentiles FP32 Latency (ms) Percentiles FP16/FP32 speed up
BS Duration (s) 90% 95% 99% Avg 90% 95% 99% Avg Avg
1 2.0 52.74 53.01 54.40 51.47 55.97 56.22 57.93 54.93 1.07
2 2.0 51.77 52.15 54.69 50.98 56.58 56.87 58.88 55.35 1.09
4 2.0 51.41 51.76 53.47 50.55 61.56 61.87 63.81 60.74 1.20
8 2.0 51.83 52.15 54.08 50.85 80.20 80.69 81.67 77.69 1.53
16 2.0 70.48 70.96 72.11 62.98 93.00 93.44 94.17 89.05 1.41
1 7.0 49.77 50.21 51.88 48.73 52.74 52.99 54.54 51.67 1.06
2 7.0 51.12 51.47 52.84 49.98 65.33 65.63 67.07 64.64 1.29
4 7.0 53.13 53.56 55.68 52.15 93.54 93.85 94.72 92.76 1.78
8 7.0 57.67 58.07 59.89 56.41 133.93 134.18 134.88 133.15 2.36
16 7.0 76.09 76.48 79.13 75.27 162.35 162.77 164.63 161.30 2.14
1 16.7 54.78 55.29 56.83 52.51 75.37 76.27 78.05 74.32 1.42
2 16.7 56.80 57.20 59.01 55.49 130.60 131.36 132.93 128.55 2.32
4 16.7 64.19 64.84 66.47 62.87 188.09 188.76 190.07 185.76 2.95
8 16.7 87.46 87.86 89.99 86.47 232.33 232.89 234.43 230.44 2.67
16 16.7 136.02 136.52 139.44 134.78 283.87 284.59 286.70 282.01 2.09

To achieve these same results, follow the Quick Start Guide outlined above.

Inference performance: NVIDIA DGX-2 (1x V100 32GB)
FP16 Latency (ms) Percentiles FP32 Latency (ms) Percentiles FP16/FP32 speed up
BS Duration (s) 90% 95% 99% Avg 90% 95% 99% Avg Avg
1 2.0 35.88 36.12 39.80 35.20 42.95 43.67 46.65 42.23 1.20
2 2.0 36.36 36.57 40.97 35.60 41.83 42.21 45.60 40.97 1.15
4 2.0 36.69 36.89 41.25 36.05 48.35 48.52 52.35 47.80 1.33
8 2.0 37.49 37.70 41.37 36.88 65.41 65.64 66.50 64.96 1.76
16 2.0 41.35 41.79 45.58 40.91 77.22 77.51 79.48 76.54 1.87
1 7.0 36.07 36.55 40.31 35.62 39.52 39.84 43.07 38.93 1.09
2 7.0 37.42 37.66 41.36 36.79 55.94 56.19 58.33 55.60 1.51
4 7.0 38.51 38.95 42.55 37.98 86.62 87.08 87.50 86.20 2.27
8 7.0 42.82 43.00 47.11 42.55 122.05 122.29 122.70 121.59 2.86
16 7.0 67.74 67.92 69.05 65.69 149.92 150.16 151.03 149.49 2.28
1 16.7 39.28 39.78 43.34 38.35 66.73 67.16 69.80 66.01 1.72
2 16.7 43.05 43.42 47.18 42.43 120.04 121.12 123.32 118.14 2.78
4 16.7 52.18 52.49 56.11 51.63 176.09 176.51 178.70 174.60 3.38
8 16.7 78.55 78.79 81.66 78.04 216.19 216.68 217.63 214.48 2.75
16 16.7 125.57 125.92 128.78 124.33 264.11 264.49 266.14 262.80 2.11

To achieve these same results, follow the Quick Start Guide outlined above.

Inference performance: NVIDIA T4
FP16 Latency (ms) Percentiles FP32 Latency (ms) Percentiles FP16/FP32 speed up
BS Duration (s) 90% 95% 99% Avg 90% 95% 99% Avg Avg
1 2.0 43.62 46.95 50.46 37.23 51.31 52.37 56.21 49.77 1.34
2 2.0 49.09 50.46 53.11 40.61 81.85 82.22 83.94 80.81 1.99
4 2.0 47.71 51.14 55.09 41.29 112.56 115.13 118.56 111.60 2.70
8 2.0 51.37 53.11 55.48 45.94 198.95 199.48 200.28 197.22 4.29
16 2.0 63.59 64.30 66.90 61.77 221.75 222.07 223.22 220.09 3.56
1 7.0 47.49 48.66 53.36 40.76 73.63 74.41 77.65 72.41 1.78
2 7.0 48.63 50.01 58.35 43.44 114.66 115.28 117.63 112.41 2.59
4 7.0 52.19 52.85 54.22 49.94 200.38 201.29 202.97 197.21 3.95
8 7.0 84.90 85.56 87.52 83.41 404.00 404.72 405.70 400.25 4.80
16 7.0 157.12 157.58 159.19 155.01 490.93 492.09 493.44 486.45 3.14
1 16.7 50.57 51.57 57.58 46.27 150.39 151.84 153.54 147.31 3.18
2 16.7 63.64 64.55 66.31 61.98 256.54 258.16 262.71 250.34 4.04
4 16.7 140.44 141.06 142.00 138.14 519.59 521.41 523.86 512.74 3.71
8 16.7 267.03 268.06 270.01 263.15 727.33 728.61 731.36 722.62 2.75
16 16.7 362.40 364.02 367.80 358.75 867.92 869.19 871.46 860.37 2.40

To achieve these same results, follow the Quick Start Guide outlined above.