Categories: Tutorials

NVIDIA’s Cosmos Video Generation: The Ultimate Creation?

The Cosmos diffusion models released by Nvidia team is capable of generating dynamic, high-quality videos from text, images, or even other videos that we explained below.

These pre-trained models are like generalists. They have been trained on massive video datasets that cover a wide range of real-world physical scenarios. This makes them incredibly versatile for tasks that require an understanding of physics.

These models released under NVIDIA open license that gives you the freedom to work for commercial purpose when working with their limitations. To get the deeper insights, access the related information from their research paper

Installation

1. First get install ComfyUI if you are new to it.

2. Old user need to Update ComfyUI from the Manager section.

3. Now, download the Nvidia Cosmos models from Hugging Face repository and save these into your “ComfyUI/models/diffusion_models” folder. Make sure to use the correct model variant. The 7Billion variant is for lower end and 14Billion is for higher end GPUs.

We observed many people is confusing with their naming convention. Here, “Text-to-World” simply derives “Text-to-Video” flow and “Video-to-World” is “Image/Video-to-Video flow. To get the raw model you can get from their github repository.

4. Download text encoders (oldt5_xxl_fp8_e4m3fn_scaled.safetensors) from Hugging Face and save it into your “ComfyUI/models/text_encoders” folder.

5. Download VAE(cosmos_cv8x8x8_1.0.safetensors) model  from Hugging Face and place it inside your “ComfyUI/models/vae” folder.

6. Restart your ComfyUI to take effect.

Workflow

1. Get the workflow from our Hugging face repository page.
(a) Text-to-Video workflow
(b) Image-to-video-workflow
2. Drag and drop into your ComfyUI.
3. Load Nvidia cosmos(text-to-video or Image-to-Video) model into Unet/Diffusion model node.
4. Load the text encoder model into clip model node.
5. Load VAE model into Load Vae node.
6. Add Positive and Negative prompts. Your prompts should be long and descriptive enough that the cosmos model can understand it. Shorter prompts will give you bad results.
7. Set KSampler settings.
8. Set your video dimension as default to 704 x 704 pixels (minimum limit).
9. Hit Queue button to generate.

We are running the model on RTX 3090 with 16BG VRAM. The video generation having 704×704 resolution took around 7-8 minutes.
To generate good results, you need to do multiple tries and pick the best one. This model is not so perfect as they claims in their official page. When comparing with other video generation like LTX-Video or HunyuanVideo, the resulted quality will not be up to mark.
admage

Share
Published by
admage

Recent Posts

Local Video Installation and Generation (Native/GGUFs)

 Again the new diffusion based video generation model released by AlibabaCloud. Wan2.1 an open-source suite of…

1 week ago

Empowering Human Connection Through SkyReels Videos

Another diffusion based video generation model is in the open source market. Skyreels, a Human-Centric…

2 weeks ago

Lumina Image 2.0 – Innovative Text to Image Creation

Lumina Image 2.0 is a powerful text-to-image generation model with 2 billion parameters that leverages…

3 weeks ago

Deepkseek Janus Pro: Optimize Image/Video Generation

While working with Text-to-Image or Text-to-Video workflows, you already know the struggle of getting accurate…

4 weeks ago

TeaCache: Boost Your ComfyUI Speed by 2x

You have ever stuck slower inference speeds with various image/video generation models. Here, the solution…

1 month ago

Train LoRA HunyuanVideo Model Locally

If you are wondering, you need to pay a lot to those third party companies…

2 months ago