Local Video Installation and Generation (Native/GGUFs)

 

install wan video model in local pc using comfy

Again the new diffusion based video generation model released by AlibabaCloud. Wan2.1 an open-source suite of video foundation models licensed under Apache 2.0. It delivers state-of-the-art performance while remaining accessible on consumer hardware. You can read more information from their research paper. 

It outperforms existing open-source models and rivals commercial solutions in the market. The TextToVideo model generates 5-second 480P video on RTX 4090 in ~4 minutes using 8.19 VRAM without optimization from both Chinese and English textual prompts. 

Model Resolution Features
T2V-14B 480P & 720P Best overall quality
I2V-14B-720P 720P Higher resolution image-to-video
I2V-14B-480P 480P Standard resolution image-to-video
T2V-1.3B 480P Lightweight for consumer hardware

Table of Contents:

Installation

No matter whatsoever workflow you want. Just install ComfyUI if you are new to it. Old users need to update ComfyUI from the Manager section by selecting “Update ComfyUI“.

Type A: Native Support

1. Download models (TextToVideo or ImageToVideo) from Hugging Face and save it into your “ComfyUI/models/diffusion_models” directory.
2. Now, download Text encoders(umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into  “ComfyUI/models/text_encoders” folder.
3. Download clip models and put it into  “ComfyUI/models/clip_vision” folder.
4. At last,  you also need download VAE models, put it into your “ComfyUI/models/vae” folder.

Workflow

1. Get the required workflow from our Hugging face repository.
2. Drag and drop into ComfyUI.
load Wan model
(a) Load Wan model(TxtToVideo or ImgToVideo) into UNet loader node.
select text encoder
(b) Load text encoder into clip node
Load VAE
(c) Select VAE model
add positive and negative prompts
(d) Put Positive/negative prompts
Set KSampler Settings
(e) Set KSampler settings
(f) Click “Queue” option to start generation.
text to Video 14b generation
Shift Test

Wan2.1 CFG testing
CFG Test

Type B: (Quantized By Kijai)

1. Clone the Wan Wrapper repository into your “custom_nodes” folder by typing following command into command prompt.
git cone https://github.com/kijai/ComfyUI-WanVideoWrapper.git
2. Move inside “ComfyUI_windows_portable” folder and open command prompt. Install the required dependencies by typing commands.
For normal ComfyUI users:
pip install -r requirements.txt
For portable ComfyUI users:
python_embededpython.exe -m pip install -r ComfyUIcustom_nodesComfyUI-WanVideoWrapperrequirements.txt
download wan model, vae , text encoders

3. Download models (TextToVideo or ImageToVideo) from Hugging Face and put it into  “ComfyUI/models/diffusion_models” folder. 

Here, there are two options (BF16 and FP8)to choose from with different video (480p and 720p) generation. Select the one that is relevant for your machine and use cases. BF 16 is for higher VRAM(more than 12GB) and FP8 for lower VRAM (12GB or lesser)users.

4. Download the relevant Text encoders and save it into “ComfyUI/models/text_encoders” folder. Select Bf16 or Fp32 variant.
5. Then you also need download the relevant VAE model and place it into your “ComfyUI/models/vae” directory. Select Bf16 or Fp32 variant.
6. Restart ComfyUI.

Workflow

1. You can get the workflow inside your “ComfyUI/custom_nodes/ComfyUI-WanVideoWrapper/example_workflows” folder.
2. Drag and drop into ComfyUI.
We tested this with ImageToVideo on RTX 3080 10GB VRAM with sage attention enabled and the generation time was around 467 seconds.
Note: This workflow uses Triton and Sageattention in the background that will increase the inference time but its optional. You can enable and disable if you do not need.

If you want to install these then make sure you have the Vcredist, CUDA12.x, Visual studio, installed on your system. Their are many confusions in setting up the Triton into windows machines. You can get the detailed understanding from Triton-windows github repository

Install Windows Trition .whl file for your python version. To check python version run “python –version” (without quotes) in command prompt. We have python 3.10 version installed. For other python version checkout Windows Trition release section.

Type C: GGUF variant (By City96)

Install custom nodes
1. Install GGUF custom nodes from the Manager section  by selecting “Custom nodes manager” option. Now, search for “ComfyUI-GGUF” (by Author City 96) and hit install. 

Users who already used the Flux GGUF/ Stable Diffusion 3.5 GGUF variant/ HunyuanVideo GGUF earlier, only need to update this custom node from the Manager by selecting “Update” option.

2. Download any of the relevant models from Hugging Face repository:-
Download rest of the models from Comfy’s Hugging Face repository
Save Img2Vid model into “ComfyUI/models/unet” folder. Clip vision into “ComfyUI/models/clip_vision“, Text encoder into “ComfyUI/models/text_encoders” and VAE into “ComfyUI/models/vae” folder.
Download the rest of the models from Comfy’s Hugging Face repository
Save Img2Vid model into “ComfyUI/models/unet” folder. Clip vision into “ComfyUI/models/clip_vision“, Text encoder into “ComfyUI/models/text_encoders” and VAE into “ComfyUI/models/vae” folder.
Download rest of the models files (Text encoder, VAE etc)from Kijai Wan repository explained above.
Save Txt2Vid model into “ComfyUI/models/unet” folder. Clip vision into “ComfyUI/models/clip_vision“, Text encoder into “ComfyUI/models/text_encoders” and VAE into “ComfyUI/models/vae” folder.
Here, you have various model types from Q3bit(very light weight, faster with lower quality generation) to Q8bit(very heavy weight, slower with high precision). Choose as per your system VRAM and use case.
3. Restart ComfyUI to take effect.

Workflow

1. Download the same workflow of Comfyui’s repository from Type B’s workflow section.
2. All will be same here. Just replace the “Load Diffusion Model” node with “UNet Loader (GGUF)” node.