This repository includes a Jupyter notebook for retraining a DeepVariant model using Parabricks. The notebook first generates a baseline VCF using the out-of-the-box DeepVariant model, then re-trains the model on custom data, and re-evaluates the performance. This is intended to be used as a reference guide and it's encourage to try this on your own data.
The zip file provided by this resource has the following structure:
.
├── Retraining_DeepVariant.ipynb
└── scripts
├── download_data.sh
└── shuffle_tfrecords_lowmem.py
Retraining_DeepVariant.ipynb
is the notebook where the code will be run.
download_data.sh
is used to download the dataset that is needed to run the notebook. This should be downloaded to <path_to_notebook>/data
.
shuffle_tfrecords_lowmem.py
is an accessory script that gets called by the notebook to shuffle the dataset.
This notebook will run on V100, T4, or A100. The accuracy can be improved by increasing the number of training steps. By default it is set fairly low at 5000 so the code runs quickly, but for full results, it should be set to 50,000 steps.