NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

GatorTron-S is a Megatron BERT model trained on pre-trained on synthetic clinical discharge summaries generated by SynGatorTron 5B NLG, a Megatron GPT-3 model trained on de-identified clinical free text at the University of Florida health system.

Publisher

University of Florida Health

Latest Version

Modified

April 4, 2023

Size

5.42 GB

Overview

GatorTron-S is a 345m-parameter cased Megatron checkpoint pre-trained on a dataset consisting of,

22B words from the University of Florida SynGatorTron 5B NLG model (a Megatron GPT-3 model) prompted to produce synthetic, de-identified discharge summaries using text sampled from MIMIC-III,
6.1B words from PubMed CC0,
2.5B words from WikiText,
0.5B words from MIMIC-III itself.

The model is designed to provide improved language understanding for downstream clinical tasks. It was trained and is released with a 50K token customized clinical vocabulary trained on the above listed data distribution.

The model is released alongside GatorTron-OG, a similar 345m-parameter cased Megatron checkpoint, but pre-trained on a large selection of de-identified, real-world clinical notes from the University of Florida Health System.

More Details

SynGatorTron 5B NLG was used to generate the primary clinical dataset used to train GatorTron-S. It is a 5B-parameter Megatron GPT-3 model which was pre-trained on 82B words of de-identified clinical notes from the University of Florida Health System together with the full Pile dataset [1].

The model was prompted to produce 22B words of synthetic discharge summaries by sampling section headers and 10-15 words from the MIMIC-III corpus, with an average total prompt length of 20 words. Examples of prompts include

Current plan: Home Home, when treatment completed, likely without services. Case Management will
DISCHARGE DIAGNOSES: 1. Prematurity. 2. Chronic lung disease. 3. Patent ductus
DISCHARGE MEDICATIONS: 1. Aspirin 325 mg po q day. 2. Plavix 75 mg

A systematic investigation of hyperparameter configurations culminated in the choice of inference at a temperature of 1.2 and a Top-P of 0.9. Additional technical detial will be made available in a forthcoming publication, and the documentation here will be updated.

Please be sure to download the most recent version in order to ensure compatibility to the latest NeMo release. The following files are provided for each release:

MegatronBERT.pt: pre-trained Megatron model weights,
config.json: the config file used to initialize model network architecture in NeMo,
vocab.txt: vocabulary file used to train the checkpoint,
hparam.yaml: model configuration used to convert the Megatron checkpoints to NeMo format,
MegatronBERT.nemo: pre-trained NeMo checkpoint.

De-Identification

De-identification of clinical notes was performed using the DeepDeID tool on all University of Florida Health clinical notes through a named entity recognition task on defined classes containing PHI followed by dummy replacements (e..g, [**NAME**]). Details of the method are available in [1].

License and User Data

All necessary legal and ethical approvals were obtained from the University of Florida Health IRB. The University of Florida has granted NVIDIA and its affiliates permission to distribute and license GatorTron under the terms of the included EULA for further use in third-party products and services.

The NVIDIA Clara End User License Agreement is included with this model. By pulling and using the model, you accept the terms and conditions of these licenses.

References

[1] Gao, L. et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv [cs.CL] (2020)

[2] Yang, X. et al. A study of deep learning methods for de-identification of clinical notes in cross-institute settings. BMC Med. Inform. Decis. Mak. 19, 232 (2019).

[3] Yang, X. et al. GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records. arXiv [cs.CL] (2022)