NGC Catalog
CLASSIC
Welcome Guest
Models
Cosmos-1.0-Diffusion-7B-Video2World

Cosmos-1.0-Diffusion-7B-Video2World

For downloads and more information, please view on a desktop device.
Logo for Cosmos-1.0-Diffusion-7B-Video2World
Description
The Cosmos diffusion models are a collection of diffusion based world foundation models that generate dynamic, high quality videos from text, image, or video inputs.
Publisher
NVIDIA
Latest Version
1.0
Modified
January 7, 2025
Size
13.48 GB

Cosmos-1.0-Diffusion: A Suite of Diffusion-based World Foundation Models

Cosmos | Code | Paper

Model Overview

Description:

Cosmos World Foundation Models: A family of highly performant pre-trained world foundation models purpose-built for generating physics-aware videos and world states for physical AI development.

The Cosmos diffusion models are a collection of diffusion based world foundation models that generate dynamic, high quality videos from text, image, or video inputs. It can serve as the building block for various applications or research that are related to world generation. The models are ready for commercial use under NVIDIA Open Model license agreement.

Model Developer: NVIDIA

Model Versions

In Cosmos 1.0 release, the Cosmos Diffusion WFM family includes the following models:

  • Cosmos-1.0-Diffusion-7B-Text2World
    • Given a text description, predict an output video of 121 frames.
  • Cosmos-1.0-Diffusion-14B-Text2World
    • Given a text description, predict an output video of 121 frames.
  • Cosmos-1.0-Diffusion-7B-Video2World
    • Given a text description and an image as the first frame, predict the future 120 frames.
  • Cosmos-1.0-Diffusion-14B-Video2World
    • Given a text description and an image as the first frame, predict the future 120 frames.

License:

This model is released under the NVIDIA Open Model License. For a custom license, please contact cosmos-license@nvidia.com.

Under the NVIDIA Open Model License, NVIDIA confirms:

  • Models are commercially usable.
  • You are free to create and distribute Derivative Models.
  • NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.

Important Note: If you bypass, disable, reduce the efficacy of, or circumvent any technical limitation, safety guardrail or associated safety guardrail hyperparameter, encryption, security, digital rights management, or authentication mechanism contained in the Model, your rights under NVIDIA Open Model License Agreement will automatically terminate.

Model Architecture:

Cosmos-1.0-Diffusion-7B-Video2World is a diffusion transformer model designed for video denoising in the latent space. The network is composed of interleaved self-attention, cross-attention and feedforward layers as its building blocks. The cross-attention layers allow the model to condition on input text throughout the denoising process. Before each layers, adaptive layer normalization is applied to embed the time information for denoising. When image or video is provided as input, their latent frames are concatenated with the generated frames along the temporal dimension. Augment noise is added to conditional latent frames to bridge the training and inference gap.

Input/Output Specifications

  • Input

    • Input Type(s): Text+Image, Text+Video
    • Input Format(s):
      • Text: String
      • Image: jpg, png, jpeg, webp
      • Video: mp4
    • Input Parameters:
      • Text: One-dimensional (1D)
      • Image: Two-dimensional (2D)
      • Video: Three-dimensional (3D)
    • Other Properties Related to Input:
      • The input string should contain fewer than 300 words and should provide descriptive content for world generation, such as a scene description, key objects or characters, background, and any specific actions or motions to be depicted within the 5-second duration.
      • The input image should be of 1280x704 resolution.
      • The input video should be of 1280x704 resolution and 9 input frames.
  • Output

    • Output Type(s): Video
    • Output Format(s): mp4
    • Output Parameters: Three-dimensional (3D)
    • Other Properties Related to Output: The generated video will be a 5-second clip with a resolution of 1280x704 pixels at 24 frames per second (fps). The content of the video will use the provided image as the first frame and visualize the input text description as a short animated scene, capturing the main elements mentioned in the input within the time constraints.

Software Integration

Runtime Engine(s):

  • Cosmos

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Blackwell
  • NVIDIA Hopper
  • NVIDIA Ampere

Note: We have only tested doing inference with BF16 precision.

Operating System(s):

  • Linux (We have not tested on other operating systems.)

Usage

  • See Cosmos for details.

Evaluation

Please see our technical paper for detailed evaluations.

Inference Time and VRAM Requirements

For a single GPU:

Model GPU Inference Time (seconds) VRAM
Cosmos-1.0-Diffusion-7B-Text2World H100 411.83 42 GiB
Cosmos-1.0-Diffusion-14B-Text2World H100 723.12 68 GiB
Cosmos-1.0-Diffusion-7B-Video2World H100 428.13 42 GiB
Cosmos-1.0-Diffusion-14B-Video2World H100 734.57 68 GiB

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the subcards of Explainability, Bias, Safety & Security, and Privacy below. Please report security vulnerabilities or NVIDIA AI Concerns here.