Again the new diffusion based video generation model released by AlibabaCloud. Wan2.1 an open-source suite of video foundation models licensed under Apache 2.0. It delivers state-of-the-art performance while remaining accessible on consumer hardware. You can read more information from their research paper.
It outperforms existing open-source models and rivals commercial solutions in the market. The TextToVideo model generates 5-second 480P video on RTX 4090 in ~4 minutes using 8.19 VRAM without optimization from both Chinese and English textual prompts.
Model |
Resolution |
Features |
T2V-14B |
480P & 720P |
Best overall quality |
I2V-14B-720P |
720P |
Higher resolution image-to-video |
I2V-14B-480P |
480P |
Standard resolution image-to-video |
T2V-1.3B |
480P |
Lightweight for consumer hardware |
Table of Contents:
Installation
No matter whatsoever workflow you want. Just install ComfyUI if you are new to it. Old users need to update ComfyUI from the Manager section by selecting “Update ComfyUI“.
Type A: Native Support
1. Download models (TextToVideo or ImageToVideo) from
Hugging Face and save it into your “
ComfyUI/models/diffusion_models” directory.
2. Now, download
Text encoders(umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into “
ComfyUI/models/text_encoders” folder.
3. Download
clip models and put it into “
ComfyUI/models/clip_vision” folder.
4. At last, you also need download
VAE models, put it into your “
ComfyUI/models/vae” folder.
Workflow
2. Drag and drop into ComfyUI.
(a) Load Wan model(TxtToVideo or ImgToVideo) into UNet loader node.
(b) Load text encoder into clip node
(c) Select VAE model
(d) Put Positive/negative prompts
(e) Set KSampler settings
(f) Click “Queue” option to start generation.
 |
Shift Test |
 |
CFG Test |
Type B: (Quantized By Kijai)
1. Clone the Wan Wrapper repository into your “custom_nodes” folder by typing following command into command prompt.
git cone https://github.com/kijai/ComfyUI-WanVideoWrapper.git
2. Move inside “ComfyUI_windows_portable” folder and open command prompt. Install the required dependencies by typing commands.
For normal ComfyUI users:
pip install -r requirements.txt
For portable ComfyUI users:
python_embededpython.exe -m pip install -r ComfyUIcustom_nodesComfyUI-WanVideoWrapperrequirements.txt
3. Download models (TextToVideo or ImageToVideo) from
Hugging Face and put it into “
ComfyUI/models/diffusion_models” folder.
Here, there are two options (BF16 and FP8)to choose from with different video (480p and 720p) generation. Select the one that is relevant for your machine and use cases. BF 16 is for higher VRAM(more than 12GB) and FP8 for lower VRAM (12GB or lesser)users.
4. Download the relevant Text encoders and save it into “ComfyUI/models/text_encoders” folder. Select Bf16 or Fp32 variant.
5. Then you also need download the relevant VAE model and place it into your “ComfyUI/models/vae” directory. Select Bf16 or Fp32 variant.
6. Restart ComfyUI.
Workflow
1. You can get the workflow inside your “ComfyUI/custom_nodes/ComfyUI-WanVideoWrapper/example_workflows” folder.
2. Drag and drop into ComfyUI.
We tested this with ImageToVideo on RTX 3080 10GB VRAM with sage attention enabled and the generation time was around 467 seconds.
Note: This workflow uses Triton and Sageattention in the background that will increase the inference time but its optional. You can enable and disable if you do not need.
If you want to install these then make sure you have the Vcredist, CUDA12.x, Visual studio, installed on your system. Their are many confusions in setting up the Triton into windows machines. You can get the detailed understanding from Triton-windows github repository.
Install Windows Trition .whl file for your python version. To check python version run “python –version” (without quotes) in command prompt. We have python 3.10 version installed. For other python version checkout Windows Trition release section.
Type C: GGUF variant (By City96)
1. Install GGUF custom nodes from the Manager section by selecting “Custom nodes manager” option. Now, search for “ComfyUI-GGUF” (by Author City 96) and hit install.
Users who already used the Flux GGUF/ Stable Diffusion 3.5 GGUF variant/ HunyuanVideo GGUF earlier, only need to update this custom node from the Manager by selecting “Update” option.
2. Download any of the relevant models from Hugging Face repository:-
Save Img2Vid model into “ComfyUI/models/unet” folder. Clip vision into “ComfyUI/models/clip_vision“, Text encoder into “ComfyUI/models/text_encoders” and VAE into “ComfyUI/models/vae” folder.
Save Img2Vid model into “ComfyUI/models/unet” folder. Clip vision into “ComfyUI/models/clip_vision“, Text encoder into “ComfyUI/models/text_encoders” and VAE into “ComfyUI/models/vae” folder.
Download rest of the models files (Text encoder, VAE etc)from Kijai Wan repository explained above.
Save Txt2Vid model into “ComfyUI/models/unet” folder. Clip vision into “ComfyUI/models/clip_vision“, Text encoder into “ComfyUI/models/text_encoders” and VAE into “ComfyUI/models/vae” folder.
Here, you have various model types from Q3bit(very light weight, faster with lower quality generation) to Q8bit(very heavy weight, slower with high precision). Choose as per your system VRAM and use case.
3. Restart ComfyUI to take effect.
Workflow
1. Download the same workflow of Comfyui’s repository from Type B’s workflow section.
2. All will be same here. Just replace the “Load Diffusion Model” node with “UNet Loader (GGUF)” node.