NGC Catalog
CLASSIC
Welcome Guest
Models
bge-large-ehr-finetune

bge-large-ehr-finetune

For downloads and more information, please view on a desktop device.
Logo for bge-large-ehr-finetune
Description
This model is the BAAI/bge-large-en-1.5 finetuned on ~2,500 synthetic EHR document and question pairs.
Publisher
NVIDIA
Latest Version
1.0
Modified
November 5, 2024
Size
1.12 GB

Model Overview

Description:

FlagEmbedding maps text to a low-dimensional dense vector, which can be used for tasks like retrieval, classification, clustering, or semantic search. It also can be used in vector databases for LLMs.

This model has been fine-tuned on domain specific data and is for demonstration purposes and not for production usage.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to the HuggingFace age model.

License/Terms of Use:

https://huggingface.co/BAAI/bge-large-en#license

Reference(s):

For more details please refer to the Github: FlagEmbedding repo.

Model Architecture:

  • Architecture Type: FlagEmbedding
  • Network Architecture: bge-large-en-v1.5

Input:

  • Input Type(s): Text
  • Input Format(s): String
  • Input Parameters: One-Dimensional (1D)
  • Other Properties Related to Input: No Pre-Processing

Output:

  • Output Type(s): Vector embeddings
  • Output Format: Dimensions: 1024 · Max Tokens: 512
  • Output Parameters: (2D)
  • Other Properties Related to Output: No Post-Processing

Software Integration

Runtime Engine(s): Holoscan SDK

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Jetson
  • NVIDIA Turing
  • NVIDIA Volta

[Preferred/Supported] Operating System(s):

  • Linux
  • Linux 4 Tegra

Model Version(s):

v 0.1

Training, Testing, and Evaluation Datasets:

Training Dataset:

Link: [https://github.com/synthetichealth/synthea]
Data Collection Method by dataset: Synthetic
Labeling Method by dataset: Synthetic
Properties (Quantity, Dataset Descriptions, Sensor(s)): 2500 synthetic electronic healthcare record (EHR) document and question pairs
Dataset License(s): [https://github.com/synthetichealth/synthea/blob/master/LICENSE]

Testing Dataset:

Link: internal dataset generated by LLM
Data Collection Method by dataset: Automated
Labeling Method by dataset: Automated
Properties (Quantity, Dataset Descriptions, Sensor(s)): 2500 Q/A pairs from synthea dataset
Dataset License: [https://github.com/synthetichealth/synthea/blob/master/LICENSE]

Evaluation Dataset:

Link: synthetic dataset generated by LLM
Data Collection Method by dataset: Automated
Labeling Method by dataset: Automated
Properties (Quantity, Dataset Descriptions, Sensor(s)): 15 Q/A pairs
Dataset License(s): [https://github.com/synthetichealth/synthea/blob/master/LICENSE]

Inference:

  • Engine: LLAMA cpp
  • Test Hardware: IGX platform

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.