Cosmos World Foundation Models: A family of highly performant pre-trained world foundation models purpose-built for generating physics-aware videos and world states for physical AI development.
The Cosmos diffusion models are a collection of diffusion based world foundation models that generate dynamic, high quality videos from text, image, or video inputs. It can serve as the building block for various applications or research that are related to world generation. The models are ready for commercial use under NVIDIA Open Model license agreement.
Model Developer: NVIDIA
In Cosmos 1.0 release, the Cosmos Diffusion WFM family includes the following models:
This model is released under the NVIDIA Open Model License. For a custom license, please contact cosmos-license@nvidia.com.
Under the NVIDIA Open Model License, NVIDIA confirms:
Important Note: If you bypass, disable, reduce the efficacy of, or circumvent any technical limitation, safety guardrail or associated safety guardrail hyperparameter, encryption, security, digital rights management, or authentication mechanism contained in the Model, your rights under NVIDIA Open Model License Agreement will automatically terminate.
Cosmos-1.0-Diffusion-14B-Video2World is a diffusion transformer model designed for video denoising in the latent space. The network is composed of interleaved self-attention, cross-attention and feedforward layers as its building blocks. The cross-attention layers allow the model to condition on input text throughout the denoising process. Before each layers, adaptive layer normalization is applied to embed the time information for denoising. When image or video is provided as input, their latent frames are concatenated with the generated frames along the temporal dimension. Augment noise is added to conditional latent frames to bridge the training and inference gap.
Input
Output
Runtime Engine(s):
Supported Hardware Microarchitecture Compatibility:
Note: We have only tested doing inference with BF16 precision.
Operating System(s):
Please see our technical paper for detailed evaluations.
The numbers provided below may vary depending on system specs and are for reference only.
Offloading Strategy | 7B Video2World | 14B Video2World |
---|---|---|
Offload prompt upsampler | 76.5 GB | > 80.0 GB |
Offload prompt upsampler & guardrails | 59.9 GB | 73.3 GB |
Offload prompt upsampler & guardrails & T5 encoder | 41.3 GB | 54.8 GB |
Offload prompt upsampler & guardrails & T5 encoder & tokenizer | 41.1 GB | 54.5 GB |
Offload prompt upsampler & guardrails & T5 encoder & tokenizer & diffusion model | 27.3 GB | 39.0 GB |
The following table shows the end-to-end inference runtime on a single H100 GPU, excluding model initialization time:
7B Video2World (offload prompt upsampler) | 14B Video2World (offload prompt upsampler, guardrails) |
---|---|
~383 seconds | ~593 seconds |
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the subcards of Explainability, Bias, Safety & Security, and Privacy below. Please report security vulnerabilities or NVIDIA AI Concerns here.