NVIDIA Cosmos

Cosmos World Foundation Models come in three model types which can all be customized in post-training: cosmos-predict, cosmos-transfer, and cosmos-reason:

	Predict	Transfer	Reason
Type	World Generation	Multi-Controlnet	Reasoning VLM
Function	Predict novel future frames given initial frames	Transfer existing control frames into photoreal frames within a video clip	Reason against frames within a video clip
Use Cases	Data Generation & Policy Evaluation	Data Augmentation	Data Curation
Inputs	Text, Image, Video	Multiple Video Modalities such as RGB, Depth, Segmentation, and more.	Video & Text
Outputs	Video	Video	Text

Product Website | Hugging Face | Paper | Paper Website

Cosmos-Predict2 is a key branch of Cosmos World Foundation Models (WFMs) specialized for future state prediction, often referred to as world models. The three main branches of Cosmos WFMs are cosmos-predict, cosmos-transfer, and cosmos-reason. We visualize the architecture of Cosmos-Predict2 in the following figure.

Key Features

Cosmos-Predict2 includes the following:

Diffusion-based world foundation models for Text2Image and Video2World generation, where a user can generate visual simulation based on text prompts or video prompts.

System Requirements

Cosmos-Predict2 has the following system requirements:

NVIDIA GPUs with Ampere architecture (RTX 30 Series, A100) or newer architectures. For detailed hardware requirements and recommendations, please refer to our performance benchmarks.
Linux operating system (Ubuntu 20.04, 22.04, or 24.04 LTS)
CUDA version 12.4 or later
Python version 3.10 or later

Download checkpoints

Generate a Hugging Face access token (if you haven't done so already). Set the access token to Read permission (default is Fine-grained).
Log in to Hugging Face with the access token:
```
huggingface-cli login
```
Accept the Llama-Guard-3-8B terms

Download the Cosmos model weights from Hugging Face:

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_diffusion_checkpoints.py --model_sizes 2B 14B --model_types Text2Image --checkpoint_dir checkpoints
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_diffusion_checkpoints.py --model_sizes 2B 14B --model_types Video2World --checkpoint_dir checkpoints

Inference with pre-trained Cosmos-Predict2 models

Models

Cosmos-Predict2 include the following models

Cosmos-Predict2-2B-Text2Image: Text to image generation
Cosmos-Predict2-14B-Text2Image: Text to image generation
Cosmos-Predict2-2B-Video2World: Video + Text based future visual world generation
Cosmos-Predict2-14B-Video2World: Video + Text based future visual world generation

License and Contact

This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.

This model includes safety and content moderation features powered by Llama Guard 3. Llama Guard 3 is used solely as a content input filter and is subject to its own license.

NVIDIA Cosmos source code is released under the Apache 2 License.

NVIDIA Cosmos models are released under the NVIDIA Open Model License. For a custom license, please contact cosmos-license@nvidia.com.

Cosmos Predict2 Container