GatorTron-S is a 345m-parameter cased Megatron checkpoint pre-trained on a dataset consisting of,
The model is designed to provide improved language understanding for downstream clinical tasks. It was trained and is released with a 50K token customized clinical vocabulary trained on the above listed data distribution.
The model is released alongside GatorTron-OG, a similar 345m-parameter cased Megatron checkpoint, but pre-trained on a large selection of de-identified, real-world clinical notes from the University of Florida Health System.
SynGatorTron 5B NLG was used to generate the primary clinical dataset used to train GatorTron-S. It is a 5B-parameter Megatron GPT-3 model which was pre-trained on 82B words of de-identified clinical notes from the University of Florida Health System together with the full Pile dataset [1].
The model was prompted to produce 22B words of synthetic discharge summaries by sampling section headers and 10-15 words from the MIMIC-III corpus, with an average total prompt length of 20 words. Examples of prompts include
A systematic investigation of hyperparameter configurations culminated in the choice of inference at a temperature of 1.2 and a Top-P of 0.9. Additional technical detial will be made available in a forthcoming publication, and the documentation here will be updated.
Please be sure to download the most recent version in order to ensure compatibility to the latest NeMo release. The following files are provided for each release:
De-identification of clinical notes was performed using the DeepDeID tool on all University of Florida Health clinical notes through a named entity recognition task on defined classes containing PHI followed by dummy replacements (e..g, [**NAME**]). Details of the method are available in [1].
All necessary legal and ethical approvals were obtained from the University of Florida Health IRB. The University of Florida has granted NVIDIA and its affiliates permission to distribute and license GatorTron under the terms of the included EULA for further use in third-party products and services.
The NVIDIA Clara End User License Agreement is included with this model. By pulling and using the model, you accept the terms and conditions of these licenses.
[1] Gao, L. et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv [cs.CL] (2020)
[2] Yang, X. et al. A study of deep learning methods for de-identification of clinical notes in cross-institute settings. BMC Med. Inform. Decis. Mak. 19, 232 (2019).
[3] Yang, X. et al. GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records. arXiv [cs.CL] (2022)