ComfyUI: Transform Images/Text into Lengthy Videos with Pyramid Flow

generate videos using text and images

Generating longer video with maximum consistency is one of the challenging task. But now it can be possible with Pyramid Flow. A text to video open source model based on Stable Diffusion3 Medium, CogVideoX, Flux1.0, WebVid-10M, OpenVid-1M, Diffusion Forcing, GameNGen, Open-Sora Plan, and VideoLLAMA2.

The entire framework is optimized in an end-to-end manner and with a single unified Diffusion Transformer (DiT). The generation of high-quality 5-second (up to 10-second) videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours is supported by this method, as demonstrated by extensive experiments. Interested people can access the research paper for in-depth understanding.

Now it can be supported in ComfyUI. Lets dive to the installation section.

Table of Contents:

Installation:

1. Install ComfyUI on your machine.

2.Update it if already installed. Select “Update all” from ComfyUI Manager.

3. Move to “ComfyUI/custom_nodes” folder. Navigate to folder address bar. Open command prompt by typing “cmd“. Then into the command prompt clone the repository using following command:

git clone https://github.com/kijai/ComfyUI-PyramidFlowWrapper.git

4. All the respective models get auto downloaded from Pyramid’s Hugging face repository. The models are not optimized. As these are raw variants, you need to wait further for the GPU optimization. 

Workflow:

1. The workflow can be found inside your “ComfyUI/custom_nodes/ComfyUI-PyramidFlowWrapper/examples” folder.

There are two workflow you can choose from:

(a) Image to Video generation 

(b) Image to Video generation

2. There are two models for different video generation length:

(a) 384p checkpoint – supports up to 5 seconds with 24FPS video generation for running under 10GB VRAM

(b) 768p checkpoint -supports maximum 10 seconds  with 24FPS video generation for 10-12 GB VRAM.

Recommended settings:

(a) Text to Video generation 

num_inference_steps=[20, 20, 20]

video_num_inference_steps=[10, 10, 10]

height=768, width=1280

guidance_scale=9.0

video_guidance_scale=5.0

temp=16

(b) Image to Video generation

num_inference_steps=[10, 10, 10]

temp=16

video_guidance_scale=4.0