bge-large-ehr-finetune

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

This model is the BAAI/bge-large-en-1.5 finetuned on ~2,500 synthetic EHR document and question pairs.

Publisher

NVIDIA

Latest Version

1.0

Modified

November 5, 2024

Size

1.12 GB

Model Overview

Description:

FlagEmbedding maps text to a low-dimensional dense vector, which can be used for tasks like retrieval, classification, clustering, or semantic search. It also can be used in vector databases for LLMs.

This model has been fine-tuned on domain specific data and is for demonstration purposes and not for production usage.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to the HuggingFace age model.

License/Terms of Use:

https://huggingface.co/BAAI/bge-large-en#license

Reference(s):

For more details please refer to the Github: FlagEmbedding repo.

Model Architecture:

Architecture Type: FlagEmbedding
Network Architecture: bge-large-en-v1.5

Input:

Input Type(s): Text
Input Format(s): String
Input Parameters: One-Dimensional (1D)
Other Properties Related to Input: No Pre-Processing

Output:

Output Type(s): Vector embeddings
Output Format: Dimensions: 1024 · Max Tokens: 512
Output Parameters: (2D)
Other Properties Related to Output: No Post-Processing

Software Integration

Runtime Engine(s): Holoscan SDK

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Jetson
NVIDIA Turing
NVIDIA Volta

[Preferred/Supported] Operating System(s):

Linux
Linux 4 Tegra

Model Version(s):

v 0.1

Training, Testing, and Evaluation Datasets:

Training Dataset:

Link: [https://github.com/synthetichealth/synthea]
Data Collection Method by dataset: Synthetic
Labeling Method by dataset: Synthetic
Properties (Quantity, Dataset Descriptions, Sensor(s)): 2500 synthetic electronic healthcare record (EHR) document and question pairs
Dataset License(s): [https://github.com/synthetichealth/synthea/blob/master/LICENSE]

Testing Dataset:

Link: internal dataset generated by LLM
Data Collection Method by dataset: Automated
Labeling Method by dataset: Automated
Properties (Quantity, Dataset Descriptions, Sensor(s)): 2500 Q/A pairs from synthea dataset
Dataset License: [https://github.com/synthetichealth/synthea/blob/master/LICENSE]

Evaluation Dataset:

Link: synthetic dataset generated by LLM
Data Collection Method by dataset: Automated
Labeling Method by dataset: Automated
Properties (Quantity, Dataset Descriptions, Sensor(s)): 15 Q/A pairs
Dataset License(s): [https://github.com/synthetichealth/synthea/blob/master/LICENSE]

Inference:

Engine: LLAMA cpp
Test Hardware: IGX platform

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.