Why MoE for Video?
Video generation models face a fundamental scaling challenge: they need massive parameter counts to capture the complexity of temporal dynamics, but inference cost scales linearly with active parameters. Mixture-of-Experts (MoE) architectures solve this by keeping total parameters high (14.4B in Wan's case) while only activating a small subset (5.3B) for each input token. This provides the representational capacity of a 14B model at the computational cost of a 5B model.
Architecture Overview
| Component | Configuration | Notes |
|---|---|---|
| Total parameters | 14.4B | Sum of all expert weights + shared layers |
| Active parameters | 5.3B per token | ~37% activation ratio |
| Number of experts | 16 per MoE layer | Each expert is a feed-forward network |
| Active experts per token | 2 (top-k=2) | Router selects 2 of 16 experts |
| MoE layers | Alternating with dense layers | Every other transformer block uses MoE |
| Router type | Learned linear + softmax | Token-level routing decisions |
Expert Routing Mechanism
The router is a learned linear layer that maps each hidden state to a 16-dimensional logit vector. After softmax, the top-2 experts are selected. The expert outputs are weighted by their routing probabilities before being summed:
This means each token can follow a different computational path through the network. In practice, we observe that motion-heavy tokens tend to activate different experts than static background tokens, suggesting the model learns to specialize experts for different aspects of video generation.
Load Balancing
Without explicit balancing, some experts receive most tokens while others are rarely activated (the "rich get richer" problem). Wan uses an auxiliary load-balancing loss that penalizes uneven expert utilization:
L_balance = α · N · Σ(f_i · P_i)
where:
f_i = fraction of tokens routed to expert i
P_i = average routing probability for expert i
N = number of experts (16)
α = balance coefficient (typically 0.01)
Sparse vs Dense: Performance Comparison
| Metric | Dense 5.3B | MoE 14.4B (5.3B active) | Dense 14B (hypothetical) |
|---|---|---|---|
| FVD (↓ better) | 285 | 198 | ~180 (estimated) |
| CLIPSIM (↑ better) | 0.282 | 0.311 | ~0.315 (estimated) |
| Inference TFLOPs | 18.2 | 19.8 | 52.4 |
| Training tokens/sec | 2,450 | 2,180 | 890 |
The MoE model achieves 95% of the estimated dense-14B quality at 38% of the compute cost. This is the core value proposition of sparse architectures in video generation.
FAQ
Can I run the MoE model on consumer GPUs?
Yes, with quantization. The 14.4B MoE model requires ~30GB in FP16. With AWQ 4-bit quantization, it fits in ~12GB VRAM. Since only 5.3B parameters are active per token, actual compute is manageable on an RTX 4090.
Do experts specialize automatically?
Yes. Analysis of routing patterns shows that experts develop specializations for motion estimation, texture synthesis, temporal consistency, and spatial composition—without being explicitly trained for these roles.
How does MoE affect video temporal consistency?
MoE improves temporal consistency because dedicated experts can focus on temporal patterns. Dense models must use the same weights for both spatial and temporal reasoning, creating interference.
For researchers publishing papers on MoE architectures, clear architecture diagrams are essential. SciDraw can generate expert routing visualizations and transformer block diagrams suitable for top-tier conferences.