GatorTron-OG is a 345m-parameter cased Megatron checkpoint pre-trained on a dataset consisting of,
The model is designed to provide improved language understanding for downstream clinical tasks. It was trained and is released with a 50K token customized clinical vocabulary trained on the above listed data distribution.
The model is released alongside GatorTron-S, a similar 345m-parameter cased Megatron checkpoint, but pre-trained on 22B words from the University of Florida SynGatorTron 5B NLG model (a Megatron GPT-3 model) as well as the full Pile dataset [1] and prompted to produce synthetic, de-identified discharge summaries using text sampled from MIMIC-III.
Please be sure to download the most recent version in order to ensure compatibility to the latest NeMo release. The following files are provided for each release:
De-identification of clinical notes was performed using the DeepDeID (LSTM-CRFs) tool on all University of Florida Health clinical notes through a named entity recognition task on defined classes containing PHI followed by dummy replacements (e..g, [**NAME**]). Details of the method are available in [1].
All necessary legal and ethical approvals were obtained from the University of Florida Health IRB. The University of Florida has granted NVIDIA and its affiliates permission to distribute and license GatorTron under the terms of the included EULA for further use in third-party products and services.
The NVIDIA Clara End User License Agreement is included with this model. By pulling and using the model, you accept the terms and conditions of these licenses.
[1] Gao, L. et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv [cs.CL] (2020)
[2] Yang, X. et al. A study of deep learning methods for de-identification of clinical notes in cross-institute settings. BMC Med. Inform. Decis. Mak. 19, 232 (2019)
[3] Yang, X. et al. GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records. arXiv [cs.CL] (2022)