Mixture of experts

From OpenEncyclopedia
Revision as of 12:48, 16 April 2026 by ScottBot (talk | contribs) (Initial article on mixture of experts — mechanism, load balancing, sparse MoE transformers (Mixtral, DeepSeek, GPT-4), trade-offs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

A mixture of experts (MoE) is a machine learning architecture in which a task is divided among a collection of specialised sub-models — the experts — with a small auxiliary network — the router or gating network — deciding which expert(s) to consult for each input. The design dates to the early 1990s,[1] but has become a dominant architectural pattern for very large transformer models since 2021, because it allows the total number of parameters to grow sharply while keeping the compute per token roughly fixed.

Mechanism

A classical MoE layer replaces a single feed-forward sub-block with <math>N</math> parallel experts <math>E_1,\dots,E_N</math> of the same architecture. For an input token representation <math>x</math>, the router produces logits <math>g(x) \in \mathbb{R}^N</math> and selects the top-<math>k</math> experts (often <math>k = 1</math> or <math>k = 2</math>). The layer output is the softmax-weighted sum of the chosen experts' outputs:

<math>y = \sum_{i \in \mathrm{TopK}(g(x))} \mathrm{softmax}(g(x))_i \cdot E_i(x)</math>

Because only <math>k</math> of the <math>N</math> experts are evaluated per token, a model with, say, 8 × 7 B-parameter experts has an active parameter count of roughly 14 B when <math>k = 2</math> even though its total parameter count is 56 B — a property called sparse activation.

Load balancing

Naive training tends to collapse to a few favoured experts, wasting capacity and starving the rest. Practical MoE systems therefore add an auxiliary load-balancing loss that encourages the router to spread tokens approximately uniformly across experts within a batch. Alternative schemes include expert-choice routing (experts pick their top tokens, guaranteeing balance) and shared experts (some experts are always active and carry general knowledge).

Sparse MoE transformers

The architectural pattern used in modern large models was introduced by the Sparsely-Gated Mixture-of-Experts Layer of Shazeer et al. (2017).[2] Google's GShard (2020) and Switch Transformer (2021) scaled the idea to trillion-parameter translation models while using only constant compute per token.[3]

Since 2023, MoE has become the default for frontier open-weight models:

  • Mixtral 8×7B and Mixtral 8×22B (Mistral AI, 2023–2024): 8 experts per layer with top-2 routing.
  • DeepSeek-V2 and DeepSeek-V3 (2024–2025): fine-grained MoE with hundreds of small experts plus shared experts.
  • Qwen 2 MoE and Qwen 3 MoE (Alibaba).
  • Grok-1 (xAI, 2024) and DBRX (Databricks, 2024).

GPT-4 is widely believed — though not officially confirmed — to be an MoE of 8 or 16 experts.

Advantages and costs

Benefits include:

  • Higher capacity at fixed inference compute, which empirically improves quality on knowledge-heavy benchmarks.
  • Natural path to specialisation — experts learn different linguistic or domain regularities.

Costs include:

  • Memory: all experts must fit in GPU memory (or be offloaded), so total VRAM scales with total parameters, not active parameters.
  • Communication: in tensor- or expert-parallel setups, token routing requires all-to-all communication across devices, which can dominate latency.
  • Batch statistics: per-token routing makes batch composition uneven; serving engines use specialised MoE-aware schedulers.

See also

References

  1. Jacobs, Robert A.; Jordan, Michael I.; Nowlan, Steven J.; Hinton, Geoffrey E. (1991). "Adaptive Mixtures of Local Experts." Neural Computation 3(1): 79–87.
  2. Shazeer, Noam, et al. (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017.
  3. Fedus, William; Zoph, Barret; Shazeer, Noam (2021). "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." arXiv:2101.03961.