Linux / amd64
NV-CLIP NIM microservice is a multimodal embeddings model that transforms an image into three dimensional (3D) textual embeddings for your vision applications. Trained on 700M proprietary images, NV-CLIP NIM microservice is the NVIDIA commercial version of OpenAI’s CLIP (Contrastive Language-Image Pre-Training) model. NV-CLIP can be applied to various areas such as multimodal search, zero-shot image classification, and downstream computer vision tasks such as object detection and more.
NVIDIA NIM, part of NVIDIA AI Enterprise, is a set of easy-to-use microservices designed to speed up generative AI deployment in enterprises. Supporting a wide range of AI models, including NVIDIA AI foundation and custom models, it ensures seamless, scalable AI inferencing, on-premises or in the cloud, leveraging industry standard APIs.
NVIDIA NIM offers prebuilt containers for Generative AI and vision AI models that can be used to develop vision applications, visual chatbots, or any application that needs to understand both vision and human language. Each NIM consists of a container and a model and uses a CUDA-accelerated runtime for all NVIDIA GPUs, with special optimizations available for many configurations. Whether on-premises or in the cloud, NIM is the fastest way to achieve accelerated generative AI inference at scale.
NV-CLIP NIM microservice provides the most performant option available with Tensor RT. NVIDIA NIM for abstracts away model inference internals such as execution engine and runtime operations.
- Scalable Deployment: NV-CLIP NIM microservice is performant and can easily and seamlessly scale from a few users to millions.
- Model: Built on cutting-edge CLIP architectures, NV-CLIP NIM microservice provides optimized and pre-generated engines for a variety of popular models.
- Flexible Integration: Easily incorporate the microservice into existing workflows and applications. NV-CLIP NIM microservice provides an OpenAI API compatible programming model and custom NVIDIA extensions for additional functionality.
- Enterprise-Grade Security: Data privacy is paramount. NVIDIA NIM emphasizes security by using safetensors, constantly monitoring and patching CVEs in our stack and conducting internal penetration tests.
- Multimodal search: Enable accurate image and text search to quickly search database of images and videos.
- Zero-shot and few-shot inference: Classify images without re-training or fine-tuning.
- Downstream vision tasks: Use the embeddings to enable downstream complex vision AI tasks such as segmentation, detection, VLMs and more.
Deploying and integrating NV-CLIP NIM microservice is straightforward and based on industry standard APIs. See the NV-CLIP NIM microservice documentation to get started.
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
Architecture Type: Transformer-Based
NV-CLIP as a backbone can be used towards various downstream tasks such as classification, detection, segmentation and text-based image retrieval.
Input Type(s): Images, Texts
Input Format(s): List of Red, Green, Blue (RGB) Images or Strings
Other Properties Related to Input: Channel Ordering of the Input: NCHW, where N = Batch Size, C = number of channels (3), H = Height of images (224), W = Width of the images (224)
Output Type(s): Float tensor
Output Format: 3D Tensor
Other Properties Related to Output: The output of this model is an embedding of an input image or text of size 1024 for ViT-H variant.
Linux
These models need to be used with NVIDIA hardware and software. These models can only be used with NV-CLIP NIM microservice.
The primary use case for these models is getting feature embeddings from images and text. These embeddings can then be used for curation, clustering, zero-shot or few-shot downstream tasks such as classification. These embeddings can also be used towards text and image-based image
Data Collection Method by dataset: Automated
Labeling Method by dataset: Automated
Properties:
Dataset | No. of Images |
---|---|
NV Internal Data | 700M |
Link: https://www.image-net.org/
Data Collection Method by dataset: Unknown
Labeling Method by dataset: Unknown
Properties: 50,000 validation images from ImageNet dataset
Methodology and KPI The performance of zero shot accuracy of NVCLIP on ImageNet validation dataset.
model | top-1 Accuracy |
---|---|
ViT-H-224 | 0.7786 |
Bias, Safety & Security, and Privacy NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards here. Please report security vulnerabilities or NVIDIA AI Concerns here.
Special Training Data Considerations The model was trained on publicly available data, which may contain toxic language and societal biases. Therefore, the model may amplify those biases, such as associating certain genders with specific social stereotypes.
Governing Terms The NIM container is governed by the NVIDIA AI Enterprise Software License Agreement | NVIDIA; and the use of this model is governed by the ai-foundation-models-community-license.pdf (nvidia.com).
You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.