Categories: Tutorials

NVIDIA’s Cosmos Video Generation: The Ultimate Creation?

The Cosmos diffusion models released by Nvidia team is capable of generating dynamic, high-quality videos from text, images, or even other videos that we explained below.

These pre-trained models are like generalists. They have been trained on massive video datasets that cover a wide range of real-world physical scenarios. This makes them incredibly versatile for tasks that require an understanding of physics.

These models released under NVIDIA open license that gives you the freedom to work for commercial purpose when working with their limitations. To get the deeper insights, access the related information from their research paper.

Installation

1. First get install ComfyUI if you are new to it.

2. Old user need to Update ComfyUI from the Manager section.

3. Now, download the Nvidia Cosmos models from Hugging Face repository and save these into your “ComfyUI/models/diffusion_models” folder. Make sure to use the correct model variant. The 7Billion variant is for lower end and 14Billion is for higher end GPUs.

We observed many people is confusing with their naming convention. Here, “Text-to-World” simply derives “Text-to-Video” flow and “Video-to-World” is “Image/Video-to-Video flow. To get the raw model you can get from their github repository.

4. Download text encoders (oldt5_xxl_fp8_e4m3fn_scaled.safetensors) from Hugging Face and save it into your “ComfyUI/models/text_encoders” folder.

5. Download VAE(cosmos_cv8x8x8_1.0.safetensors) model from Hugging Face and place it inside your “ComfyUI/models/vae” folder.

6. Restart your ComfyUI to take effect.

Workflow

1. Get the workflow from our Hugging face repository page.

(a) Text-to-Video workflow

(b) Image-to-video-workflow

2. Drag and drop into your ComfyUI.

3. Load Nvidia cosmos(text-to-video or Image-to-Video) model into Unet/Diffusion model node.

4. Load the text encoder model into clip model node.

5. Load VAE model into Load Vae node.

6. Add Positive and Negative prompts. Your prompts should be long and descriptive enough that the cosmos model can understand it. Shorter prompts will give you bad results.

7. Set KSampler settings.

8. Set your video dimension as default to 704 x 704 pixels (minimum limit).

9. Hit Queue button to generate.

We are running the model on RTX 3090 with 16BG VRAM. The video generation having 704×704 resolution took around 7-8 minutes.

To generate good results, you need to do multiple tries and pick the best one. This model is not so perfect as they claims in their official page. When comparing with other video generation like LTX-Video or HunyuanVideo, the resulted quality will not be up to mark.

admage