AI Blueprint for Video Search and Summarization

NGC Catalog

CLASSIC

Welcome Guest

For versions and more information, please view on a desktop device.

Description

Blueprint for the Video Search and Summarization Agent

Publisher

NVIDIA

Latest Version

2.3.0

Compressed Size

820.6 KB

Modified

April 28, 2025

Introduction

Advances in AI video understanding and interaction have the potential to revolutionize how we access, analyze, and interact with video content in various domains. These AI models are capable of:

Video captioning-Generating text descriptions or summary of videos.
Question answering-Answering questions about a video's content.
Video retrieval-Finding specific videos (highlights) based on text queries.
Action recognition-Identifying actions happening in the video.

The current release of Video Search and Summarization Agent (VSS) demonstrates Video Summarization, Q&A and alerts with accelerated performance on NVIDIA hardware.

Features

VSS supports video upload, live stream support, summarizing on video files, image files and live streams with various configuration options. Features:

Faster/Quick Long video processing
Image / Multi-Image support
Live Stream (RTSP) support
Supported file formats: mp4, mkv, jpg, png
Supported codecs: h264/h265 video and Opus/Vorbis audio
Summarization for videos, images, and live streams
Q&A for files, images and live-streams
Event & Alerts
TRT-LMM acceleration for VILA-1.5 and NVILA
Multi Node multiple GPU support
Context aware RAG support for enhanced accuracy & Q&A
- Graph RAG
- Vector RAG
Support for GPT-4o as the VLM and LLM
Use OpenAI Compatible hosted VLM models
Drop-in support for custom VLMs
Guardrails support
OpenAI Compatible REST API
Multi-stream support
Use of Riva ASR based audio transcription in summarization, QnA, and alerts
CV pipeline to generate CV metadata and Set of Marks (SOM) Prompting for videos and live streams
Support for finetuned NVILA : Recipe to fuse LoRA checkpoint with Base NVILA model

Architecture

User Guide

User Guide is available at: https://docs.nvidia.com/vss/index.html

NOTE: This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.

Deployment Note

The Video Search and Summarization Blueprint is shared as reference and is provided "as is". The security in the production environment is the responsibility of the end users deploying it. When deploying in a production environment, please have security experts review any potential risks and threats; define the trust boundaries, implement logging and monitoring capabilities, secure the communication channels, integrate AuthN & AuthZ with appropriate access controls, keep the deployment up to date, ensure the containers/source code are secure and free of known vulnerabilities. The end users are also responsible for ensuring integrity and authenticity of the models and containers.

Known CVEs

VSS Engine 2.3.0 Container has the following known CVEs:

CVE	Description
CVE-2024-8966	This impacts gradio <= 5.22.0 python package, This impacts the file upload functionality of Gradio UI where an attacker can cause Denial-of-Service (DoS) attack by appending a large number of characters to the end of a multipart boundary. This affects the Gradio UI of VSS.
CVE-2025-32434	This impacts the torch v2.51.0 python package. This impacts loading of saved model weights from a tar file using torch.load() API which can result in remote code execution in case of malicious weights. The default weights for the models used by VSS are in safetensors format and are not affected by this vulnerability since torch.load() is not used. However, users must ensure safety of the weights if using other formats.

CVE

Description

CVE-2024-8966

This impacts gradio <= 5.22.0 python package, This impacts the file upload functionality of Gradio UI where an attacker can cause Denial-of-Service (DoS) attack by appending a large number of characters to the end of a multipart boundary. This affects the Gradio UI of VSS.

CVE-2025-32434

This impacts the torch v2.51.0 python package. This impacts loading of saved model weights from a tar file using torch.load() API which can result in remote code execution in case of malicious weights. The default weights for the models used by VSS are in safetensors format and are not affected by this vulnerability since torch.load() is not used. However, users must ensure safety of the weights if using other formats.

VSS Engine 2.3.0 Source Code has the following known CVEs:

CVE	Description
CVE-2024-7246	This affects the gRPC python package. It's possible for a gRPC client communicating with a HTTP/2 proxy to poison the HPACK table between the proxy and the backend such that other clients see failed requests. By default, VSS does not use a HTTP/2 proxy.
CVE-2024-27444	This issue is reported for langchain-milvus 0.1.5 dependency on older langchain version 0.1.5. However, VSS explicitly uses langchain 0.3.3 and hence is not applicable.
CVE-2024-28088	This issue is reported for langchain-milvus 0.1.5 dependency on older langchain version 0.1.5. However, VSS explicitly uses langchain 0.3.3 and hence is not applicable.
CVE-2024-38459	This issue is reported for langchain-milvus 0.1.5 dependency on older langchain version 0.1.5. However, VSS explicitly uses langchain 0.3.3 and hence is not applicable.

VSS 2.2.0 (Previous Release) has the following known CVEs:

CVE	Description
CVE-2024-11393	This impacts the transformers v4.47.0 python package. This impacts the Hugging Face Transformers MaskFormer Model Deserialization and allows remote attackers to execute arbitrary code. User interaction is required to exploit this vulnerability in that the target must visit a malicious page or open a malicious file. However, this does not affect VSS since MaskFormer model is not used in VSS.
CVE-2024-11392	This impacts the transformers v4.47.0 python package. This impacts the Hugging Face Transformers MobileViTV2 Model Deserialization and allows remote attackers to execute arbitrary code. User interaction is required to exploit this vulnerability in that the target must visit a malicious page or open a malicious file. However, this does not affect VSS since MobileViTV2 model is not used in VSS.
CVE-2024-11394	This impacts the transformers v4.47.0 python package. This impacts the Hugging Face Transformers Trax Model Deserialization and allows remote attackers to execute arbitrary code. User interaction is required to exploit this vulnerability in that the target must visit a malicious page or open a malicious file. However, this does not affect VSS since Trax model is not used in VSS.

GOVERNING TERMS

The software and materials are governed by the NVIDIA Software License Agreement and the Product-Specific Terms for NVIDIA AI Products, except for models which are governed by the NVIDIA Community Model License.

Additional information: Llama 3.1 Community License Agreement for Llama-3.1-70b-instruct; Llama 3.2 Community License Agreement for NVIDIA Retrieval QA Llama 3.2 1B Embedding v2 and NVIDIA Retrieval QA Llama 3.2 1B Reranking v2; Apache License, Version 2.0 for https://github.com/google-research/big_vision/blob/main/LICENSE and Apache License, Version 2.0 for https://github.com/01-ai/Yi/blob/main/LICENSE. Built with Llama.