Can I use reference images to control the video?

Yes. Wan2.2 supports image-to-video and text+image conditioning modes. The reference image anchors the first frame.

Wans2V – Wan 2.2 Image-to-Video AI Generation (14B MoE)

Q: What resolution does Wan2.2 I2V output?

480P or 720P at 24 frames per second, depending on the configuration flag. 720P is recommended for cinematic quality.

Q: How much VRAM is needed?

The 14B MoE model requires approximately 24 GB VRAM for 720P generation. 480P can run on 16 GB GPUs.

Q: What is the MoE architecture?

Mixture of Experts. Wan2.2 uses two experts: one handles high-noise layout structure, the other refines details at low noise. Only 14B parameters are active per step.

Q: Is Wan2.2 open-source?

Yes. Model weights and code are available on Hugging Face and GitHub. Compatible with Diffusers and Replicate.

Capabilities

🖼️

Image → Video

Turn a single image into a multi-second video clip with natural motion and camera movement.

🧠

MoE Architecture

Two-expert design: high-noise expert handles layout structure, low-noise expert refines texture and detail.

🎨

Aesthetic Curation

Trained with curated data for lighting, composition, and contrast. Reduced unrealistic camera motion.

🔌

Easy Integration

Hugging Face, Replicate, and Diffusers. Pre-built inference pipelines and Docker images.

📏

Multi-Resolution

Generate at 480P for fast previews or 720P for final production. Aspect ratios: 16:9, 9:16, 1:1.

📝

Text + Image Control

Combine a reference image with a text prompt to steer motion direction, style, and narrative.

Wan 2.2 I2V Technical Deep Dive

Architecture Overview

Wan 2.2 is built on a DiT (Diffusion Transformer) backbone augmented with a Mixture-of-Experts (MoE) routing mechanism. Unlike standard diffusion models that use a single U-Net or Transformer for all denoising steps, Wan 2.2 routes computation through two specialized experts:

Expert A (Layout) — activated during high-noise timesteps (t > 0.5). Focuses on global structure: object placement, camera path, and scene composition.
Expert B (Detail) — activated during low-noise timesteps (t ≤ 0.5). Refines textures, facial features, lighting gradients, and temporal consistency between frames.

This separation mirrors how human artists work: sketch the composition first, then add detail. The result is 14 billion active parameters per step (out of 28B total), achieving better quality than a single 14B model because each expert specializes.

Training Data Improvements

Compared to Wan 2.1, the training dataset was expanded by 65% for images and 83% for video clips. A key change was aesthetic curation: the team filtered training videos for lighting quality, stable camera motion, and composition. This directly addresses the common complaint about AI-generated video—unrealistic camera zoom or rotation. Wan 2.2 produces noticeably more stable, cinematic camera behavior.

Running Wan 2.2 Locally

The model requires an NVIDIA GPU with at least 24 GB VRAM for 720P generation (a single RTX 4090 or A100). For 480P, 16 GB is sufficient. The recommended setup uses the diffusers library:

Install: pip install diffusers accelerate
Download weights: huggingface-cli download Wan-AI/Wan2.2-I2V-14B
Run inference with a reference image and text prompt

Cloud options include Replicate (pay-per-generation) and Hugging Face Inference Endpoints (dedicated GPU). For researchers who need to present video generation results in academic papers, SciDraw can help create publication-quality figures showing frame sequences, architecture diagrams, and quantitative comparison charts.

Comparison with Other Video Models

Wan 2.2 I2V-A14B outperforms Stable Video Diffusion (SVD) in FVD and CLIPSIM metrics on the VBench benchmark. Compared to Runway Gen-3 and Kling, it offers the advantage of being fully open-source with reproducible results. The MoE design gives it a quality advantage over single-model approaches at similar parameter counts, particularly in temporal consistency and camera motion realism.

Frequently Asked Questions

What resolution does Wan2.2 I2V output?

480P or 720P at 24 frames per second, depending on configuration. 720P is recommended for cinematic quality.

Can I use reference images?

Yes. Wan2.2 supports image-to-video and text+image conditioning. The reference image anchors the first frame.

How much VRAM is needed?

24 GB for 720P (RTX 4090 / A100). 16 GB for 480P (RTX 4080 / L4).

What is the MoE architecture?

Mixture of Experts. Two specialized sub-networks: one for layout (high noise), one for detail (low noise). Only 14B parameters are active per step.

Is Wan2.2 open-source?

Yes. Weights and code on Hugging Face and GitHub. Compatible with Diffusers and Replicate.

Wan 2.2 Image-to-Video Generation