Linux / amd64
This resource contains Jupyter notebooks that cover the best practices in profiling and optimizing transformer based NLP models like BERT (Bidirectional Encoder Representations from Transformers). The notebooks cover both training and inference workflows.
The training demos show pre-training and fine-tuning on TensorFlow and quickly improving performance using automatic mixed precision (AMP), TensorFlow XLA compiler and effectively profile models using DLProf.
The inference notebook explains importing a trained model, build and run optimized inference for batch and streaming use-cases.
docker pull nvcr.io/nvidia/bert_workshop:20.03
docker run --gpus all --rm -it \
-p 8888:8888 \
-p 6006:6006 \
nvcr.io/nvidia/bert_workshop:20.03
It has one Python notebook which contains some details about workflow to develop a BERT training/inference/deployment workflow with some of the available tools that NVIDIA provides.
Training folder contains three notebooks including data preparation, training and profiling optimization.
Data Preparation: The data preparation section starts with downloading a pretrained model. For this we will use the Google pre-trained models that have been trained on the Wikipedia and BookCorpus datasets. After that, you can see some details about NGC and then there is the fine tune section. We will use one of the fine-tuned models to demonstrate performance which can be achieved from a fully trained model as well as optimized for inference/deployment. We use the NGC CLI to download the corresponding model. In this example, we download a BERT-Large model with a sequence length of 384 and a FP16 precision. These values can be changed to download and try out other models. You can change these parameters in the existing cell. The next thing that we need is a dataset. We download the SQuAD v2 dataset using the bertPrep.py
script. At the end of this notebook you can see a link to the next notebook (training notebook)
Training notebook: In this notebook, first we check the kind of hardware that is being used using nvidia-smi command. Using the nvidia-smi, we need to make a few changes to the run_squad.sh script to make sure paths to all of our models and data have been updated to the correct locations. You can see the cell that used to update the address of the model and data. The next step is to set model parameters and train the model using bash. We have now made the appropriate changes to the script which will run a single epoch of fine-tuning with the accompanying NGC script using the following command:
!bash ../scripts/run_squad.sh 10 5e-6 ${PRECISION} true 1 ${SEQ_LENGTH} 128 ${BERT_MODEL} 2.0 ${PRETRAINED_BERT_DIR}/bert_model.ckpt 0.1
The NGC BERT model scripts have packaged up everything needed to kick off this process into a single easy to call line. You can see this command and its parameters in the next cells. For example this command for pretraining:
#!bash scripts/run_pretraining_lamb.sh 16 2 2 7.5e-4 5e-4 fp16 true 8 2000 200 7820 100 512 2048 large
And this example for fine tuning:
#!bash scripts/run_squad.sh 10 5e-6 fp16 true 1 384 128 large 2.0 results/models/model.ckpt-8144 0.1 &
When training is finished, you can check the result. Your saved results can be found in a folder that ends with gbs10_XXXXXXXXXXXX. The next step is Explore the prediction. We will need to run evaluation on that model to produce predictions for the dataset. After that we will use a handy helper function to print the question and answer into a table. You can see the cells for prediction and display prediction in Explore the prediction section.
Profiling:
In the profiling notebook we explore some ways that training can be accelerated while maintaining accuracy. One of the easiest ways to benefit from the Tensor Cores that are present in Volta and Turing architectures is to use Automatic Mixed Precision (AMP) during training. Mixed precision is the combined use of different numerical precisions in a computational method. Although performance improvements while using AMP are heavily dependent on model architecture, some models can achieve up to 3x faster training times by simply adding AMP to the training process. Deep learning profiling section explains an example of using DLProf with our bert example. You can follow cells in this section to do profiling for our deep learning model
Once we have a trained model, the next logical step would be to prepare that model for inference. There are many different names for this process: some say inference, some say scoring, some say prediction. We discuss NVIDIA's TensorRT which allows for the optimization (using many different techniques) of many types of models (including, in this case, the BERT model for NLP).
After our model has been trained, we have a few ways that we can think about using it for "inference". The first way is to use simple TensorFlow inference (assuming you are using TensorFlow). This will most likely be the most straight-forward way to use your model since you will simply have to load the model and run something like model.predict(). While this may be GPU accelerated, the same way training is, it may not be the most optimized version of the model for inference on specific GPU. platforms. In this example we trained the model for only one epoch so it is not well trained and optimized. For that we want to use NVIDIA's TensorRT SDK which creates an optimized version of your model for inference.
Whether the model will be used for batch processing of requests in large chunks or streaming for a user to interact with as fast as possible, TensorRT will assist with the creation of the most efficient model to run on the GPU. To experiment with a few models that were optimized with TensorRT directly without going through the process of creating our own, we can grab one from the set of NGC pre-trained models specifically for TensorRT.
To get started, we want to download a fine-tuned model for the question/answer task that we will use for inference. We will grab this model from NVIDIA GPU Cloud (NGC). We use one of the fine-tuned models to demonstrate performance which can be achieved from a fully trained model as well as optimized for inference/deployment. We will use the NGC CLI (which has been pre-installed in this container) to download the corresponding model. In this case, we will download a BERT-Large model with a sequence length of 384 and a FP16 precision. After downloading the model, we built the TensorRT model. Then we provide a passage and question (since we have fine-tuned our model for the Q&A task) for the BERT model to infer on. Finally we test the model using the provided data. We have now gone through the process of taking a fine-tuned model from NGC, creating an optimized TensorRT model and running inference with it.
This container image is impacted by the following CVEs contained in upstream packages used by the image (TensorFlow tf1):
Data Leak in TensorFlow - CRITICAL vulnerability found in non-os package type (python)
Integer Truncation in Shard API Usage - CRITICAL vulnerability found in non-os package type (python)
Denial of Service in TensorFlow - CRITICAL vulnerability found in non-os package type (python)
Heap Buffer Overflow in TensorFlow - HIGH vulnerability found in non-os package type (python)
NLTK - HIGH vulnerability found in non-os package type (python)
Data Corruption in tensorflow-lite - HIGH vulnerability found in non-os package type (python)
Segfault and data corruption in tensorflow-lite - HIGH vulnerability found in non-os package type (python)
Denial of Service in TensorFlow - HIGH vulnerability found in non-os package type (python)
Fixes to these vulnerabilities are currently being worked on and will be patched in v20.10.