NGC | Catalog
CatalogModelsGatorTron-OG

GatorTron-OG

Logo for GatorTron-OG
Description
GatorTron-OG is a Megatron BERT model trained on pre-trained on de-identified clinical notes from the University of Florida Health System.
Publisher
University of Florida Health
Latest Version
1
Modified
April 4, 2023
Size
5.42 GB

Overview

GatorTron-OG is a 345m-parameter cased Megatron checkpoint pre-trained on a dataset consisting of,

  • 82B words of de-identified clinical notes from the University of Florida Health System,
  • 6.1B words from PubMed CC0,
  • 2.5B words from WikiText,
  • 0.5B words from MIMIC-III itself.

The model is designed to provide improved language understanding for downstream clinical tasks. It was trained and is released with a 50K token customized clinical vocabulary trained on the above listed data distribution.

The model is released alongside GatorTron-S, a similar 345m-parameter cased Megatron checkpoint, but pre-trained on 22B words from the University of Florida SynGatorTron 5B NLG model (a Megatron GPT-3 model) as well as the full Pile dataset [1] and prompted to produce synthetic, de-identified discharge summaries using text sampled from MIMIC-III.

More Details

Please be sure to download the most recent version in order to ensure compatibility to the latest NeMo release. The following files are provided for each release:

  • MegatronBERT.pt: pre-trained Megatron model weights,
  • config.json: the config file used to initialize model network architecture in NeMo,
  • vocab.txt: vocabulary file used to train the checkpoint,
  • hparam.yaml: model configuration used to convert the Megatron checkpoints to NeMo format,
  • MegatronBERT.nemo: pre-trained NeMo checkpoint.

De-Identification

De-identification of clinical notes was performed using the DeepDeID (LSTM-CRFs) tool on all University of Florida Health clinical notes through a named entity recognition task on defined classes containing PHI followed by dummy replacements (e..g, [**NAME**]). Details of the method are available in [1].

License and User Data

All necessary legal and ethical approvals were obtained from the University of Florida Health IRB. The University of Florida has granted NVIDIA and its affiliates permission to distribute and license GatorTron under the terms of the included EULA for further use in third-party products and services.

The NVIDIA Clara End User License Agreement is included with this model. By pulling and using the model, you accept the terms and conditions of these licenses.

References

[1] Gao, L. et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv [cs.CL] (2020)

[2] Yang, X. et al. A study of deep learning methods for de-identification of clinical notes in cross-institute settings. BMC Med. Inform. Decis. Mak. 19, 232 (2019)

[3] Yang, X. et al. GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records. arXiv [cs.CL] (2022)