MoE Sparse Activation Deep Dive

Why MoE for Video?

Video generation models face a fundamental scaling challenge: they need massive parameter counts to capture the complexity of temporal dynamics, but inference cost scales linearly with active parameters. Mixture-of-Experts (MoE) architectures solve this by keeping total parameters high (14.4B in Wan's case) while only activating a small subset (5.3B) for each input token. This provides the representational capacity of a 14B model at the computational cost of a 5B model.

Architecture Overview

Component	Configuration	Notes
Total parameters	14.4B	Sum of all expert weights + shared layers
Active parameters	5.3B per token	~37% activation ratio
Number of experts	16 per MoE layer	Each expert is a feed-forward network
Active experts per token	2 (top-k=2)	Router selects 2 of 16 experts
MoE layers	Alternating with dense layers	Every other transformer block uses MoE
Router type	Learned linear + softmax	Token-level routing decisions

Expert Routing Mechanism

The router is a learned linear layer that maps each hidden state to a 16-dimensional logit vector. After softmax, the top-2 experts are selected. The expert outputs are weighted by their routing probabilities before being summed:

y = Σ_i∈top-k g(x)_i · E_i(x) where g(x) = softmax(W_r · x)

This means each token can follow a different computational path through the network. In practice, we observe that motion-heavy tokens tend to activate different experts than static background tokens, suggesting the model learns to specialize experts for different aspects of video generation.

Load Balancing

Without explicit balancing, some experts receive most tokens while others are rarely activated (the "rich get richer" problem). Wan uses an auxiliary load-balancing loss that penalizes uneven expert utilization:

L_balance = α · N · Σ(f_i · P_i)

where:
  f_i = fraction of tokens routed to expert i
  P_i = average routing probability for expert i
  N   = number of experts (16)
  α   = balance coefficient (typically 0.01)

Sparse vs Dense: Performance Comparison

Metric	Dense 5.3B	MoE 14.4B (5.3B active)	Dense 14B (hypothetical)
FVD (↓ better)	285	198	~180 (estimated)
CLIPSIM (↑ better)	0.282	0.311	~0.315 (estimated)
Inference TFLOPs	18.2	19.8	52.4
Training tokens/sec	2,450	2,180	890

The MoE model achieves 95% of the estimated dense-14B quality at 38% of the compute cost. This is the core value proposition of sparse architectures in video generation.

FAQ

Can I run the MoE model on consumer GPUs?

Yes, with quantization. The 14.4B MoE model requires ~30GB in FP16. With AWQ 4-bit quantization, it fits in ~12GB VRAM. Since only 5.3B parameters are active per token, actual compute is manageable on an RTX 4090.

Do experts specialize automatically?

Yes. Analysis of routing patterns shows that experts develop specializations for motion estimation, texture synthesis, temporal consistency, and spatial composition—without being explicitly trained for these roles.

How does MoE affect video temporal consistency?

MoE improves temporal consistency because dedicated experts can focus on temporal patterns. Dense models must use the same weights for both spatial and temporal reasoning, creating interference.

For researchers publishing papers on MoE architectures, clear architecture diagrams are essential. SciDraw can generate expert routing visualizations and transformer block diagrams suitable for top-tier conferences.