University of Florida Health
GatorTron-OG
Model
University of Florida Health
GatorTron-OG

GatorTron-OG is a Megatron BERT model trained on pre-trained on de-identified clinical notes from the University of Florida Health System.

Overview

GatorTron-OG is a 345m-parameter cased Megatron checkpoint pre-trained on a dataset consisting of,

  • 82B words of de-identified clinical notes from the University of Florida Health System,
  • 6.1B words from PubMed CC0,
  • 2.5B words from WikiText,
  • 0.5B words from MIMIC-III itself.

The model is designed to provide improved language understanding for downstream clinical tasks. It was trained and is released with a 50K token customized clinical vocabulary trained on the above listed data distribution.

The model is released alongside GatorTron-S, a similar 345m-parameter cased Megatron checkpoint, but pre-trained on 22B words from the University of Florida SynGatorTron 5B NLG model (a Megatron GPT-3 model) as well as the full Pile dataset [1] and prompted to produce synthetic, de-identified discharge summaries using text sampled from MIMIC-III.

More Details

Please be sure to download the most recent version in order to ensure compatibility to the latest NeMo release. The following files are provided for each release:

  • MegatronBERT.pt: pre-trained Megatron model weights,
  • config.json: the config file used to initialize model network architecture in NeMo,
  • vocab.txt: vocabulary file used to train the checkpoint,
  • hparam.yaml: model configuration used to convert the Megatron checkpoints to NeMo format,
  • MegatronBERT.nemo: pre-trained NeMo checkpoint.

De-Identification

De-identification of clinical notes was performed using the DeepDeID (LSTM-CRFs) tool on all University of Florida Health clinical notes through a named entity recognition task on defined classes containing PHI followed by dummy replacements (e..g, [**NAME**]). Details of the method are available in [1].

License and User Data

All necessary legal and ethical approvals were obtained from the University of Florida Health IRB. The University of Florida has granted NVIDIA and its affiliates permission to distribute and license GatorTron under the terms of the included EULA for further use in third-party products and services.

The NVIDIA Clara End User License Agreement is included with this model. By pulling and using the model, you accept the terms and conditions of these licenses.

References

[1] Gao, L. et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv [cs.CL] (2020)

[2] Yang, X. et al. A study of deep learning methods for de-identification of clinical notes in cross-institute settings. BMC Med. Inform. Decis. Mak. 19, 232 (2019)

[3] Yang, X. et al. GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records. arXiv [cs.CL] (2022)

Publisher
University of Florida Health
Latest Version1
UpdatedApril 4, 2023 UTC
Compressed Size5.42 GB

NVIDIA uses cookies to improve your experience on our web site. We and our third-party partners also use cookies and other tools to collect and record information you provide as well as information about your interactions with our websites for performance improvement, analytics, and to assist in marketing efforts. By clicking "Accept All", you consent to our use of cookies and other tools as described in our Cookie Policy. You can manage your cookie settings by clicking on "Manage Settings." By continuing to use this site or by clicking one of the buttons below, you agree to our Terms of Service (which contains important waivers). Please see our Privacy Policy for more information on our privacy practices.