Mixture of experts

From OpenEncyclopedia

A mixture of experts (MoE) is a machine learning architecture in which a task is divided among a collection of specialised sub-models — the experts — with a small auxiliary network — the router or gating network — deciding which expert(s) to consult for each input. The design dates to the early 1990s,[1] but has become a dominant architectural pattern for very large transformer models since 2021, because it allows the total number of parameters to grow sharply while keeping the compute per token roughly fixed.

History

Origins (1991–2000s)

The MoE concept was introduced by Robert Jacobs, Michael Jordan, Steven Nowlan, and Geoffrey Hinton in 1991. Their paper proposed a system of specialist networks, each handling a different region of the input space, coordinated by a gating network trained via expectation–maximisation. The idea drew on the divide-and-conquer principle: rather than forcing one monolithic model to handle all inputs, let specialised modules each master a subset.

Through the 1990s and 2000s, MoE was primarily studied in the context of ensemble methods, Gaussian mixture models, and small-scale classification tasks. The approach remained a niche technique because contemporary models were small enough that dense networks sufficed.

Revival with scale (2017–2021)

The idea was revived for large neural networks by Noam Shazeer et al. in their 2017 paper "Outrageously Large Neural Networks," which introduced the Sparsely-Gated Mixture-of-Experts Layer — a drop-in replacement for a transformer's feed-forward sub-block that could scale a model to 137 billion parameters while using only a fraction of them per token.[2]

Google scaled the idea further with GShard (2020), which distributed MoE layers across thousands of TPU cores for translation, and the Switch Transformer (2021), which simplified routing to top-1 expert selection and scaled to over one trillion parameters.[3]

The MoE era (2023–present)

Since 2023, MoE has become the default architecture for frontier open-weight models, driven by the realisation that sparse models offer better quality per FLOP than dense models of equivalent compute budget.

Mechanism

A classical MoE layer replaces a single feed-forward sub-block with <math>N</math> parallel experts <math>E_1,\dots,E_N</math> of the same architecture. For an input token representation <math>x</math>, the router produces logits <math>g(x) \in \mathbb{R}^N</math> and selects the top-<math>k</math> experts (often <math>k = 1</math> or <math>k = 2</math>). The layer output is the softmax-weighted sum of the chosen experts' outputs:

<math>y = \sum_{i \in \mathrm{TopK}(g(x))} \mathrm{softmax}(g(x))_i \cdot E_i(x)</math>

Because only <math>k</math> of the <math>N</math> experts are evaluated per token, a model with, say, 8 × 7 B-parameter experts has an active parameter count of roughly 14 B when <math>k = 2</math> even though its total parameter count is 56 B — a property called sparse activation.

Routing strategies

The choice of routing algorithm profoundly affects model quality, training stability, and hardware efficiency.

Top-k routing

The standard approach: the gating network scores all experts and selects the <math>k</math> highest-scoring ones. Top-1 (Switch Transformer) minimises compute but can be unstable; top-2 (Mixtral) balances quality and cost.

Expert-choice routing

Introduced by Zhou et al. (2022), expert-choice routing inverts the selection: each expert selects its top-<math>c</math> tokens from the batch, guaranteeing perfect load balance by construction.[4] This eliminates the need for auxiliary balancing losses but requires fixed-size expert buffers.

Shared experts

DeepSeek-V2 (2024) introduced shared experts — a subset of experts that are always active for every token, carrying general-purpose knowledge, while the remaining experts are routed sparsely. This hybrid approach stabilises training and improves quality on knowledge-heavy tasks.

Soft MoE

Soft MoE (Puigcerver et al., 2023) replaces discrete top-k routing with a fully differentiable soft assignment: each expert receives a weighted combination of all tokens, and the output is a weighted combination of all experts' outputs.[5] This eliminates load imbalance entirely but sacrifices the compute savings of sparsity.

Fine-grained routing

DeepSeek-V3 (2025) uses fine-grained MoE with 256 small experts per layer (rather than 8–16 large ones) and top-8 routing, achieving finer-grained specialisation and smoother load distribution.

Load balancing

Naive training tends to collapse to a few favoured experts, wasting capacity and starving the rest. Practical MoE systems therefore add an auxiliary load-balancing loss that encourages the router to spread tokens approximately uniformly across experts within a batch.

The standard formulation (from Switch Transformer) adds a penalty proportional to the product of each expert's fraction of tokens received and its average routing probability — penalising experts that receive disproportionately many tokens. The loss weight is a hyperparameter; too large degrades quality, too small allows collapse.

Sparse MoE transformers

Since 2023, MoE has become the default for frontier open-weight models:

  • Mixtral 8×7B and Mixtral 8×22B (Mistral AI, 2023–2024): 8 experts per layer with top-2 routing. Mixtral 8×7B matched or exceeded Llama 2 70B on most benchmarks while using only ~13B active parameters.
  • DeepSeek-V2 (2024): 160 fine-grained experts with shared experts and multi-head latent attention, achieving GPT-4-level performance on many benchmarks at a fraction of the training cost.
  • DeepSeek-V3 (2025): 256 experts per layer, top-8 routing, multi-token prediction objective, trained for reportedly $5.6M in compute — a landmark in cost-efficient frontier model training.
  • Qwen 2 MoE and Qwen 3 MoE (Alibaba, 2024–2025): production-grade MoE models with open weights.
  • Grok-1 (xAI, 2024): 314B total parameters, 8 experts, open-weights under Apache 2.0.
  • DBRX (Databricks, 2024): 132B total, 16 experts with top-4 routing.
  • Llama 4 Maverick and Llama 4 Scout (Meta, 2025): Meta's first MoE releases, with Scout using 16 experts and a 10-million-token context window.

GPT-4 is widely believed — though not officially confirmed — to be an MoE of 8 or 16 experts, with a rumoured total parameter count of ~1.76 trillion.

Inference and serving

MoE models present unique challenges for inference:

Memory requirements

All experts must fit in memory (or be available for rapid loading), so total VRAM scales with total parameters, not active parameters. A 56B-total MoE model requires roughly the same memory as a 56B dense model, despite computing like a 14B model.

Expert parallelism

In multi-GPU serving, expert parallelism distributes different experts across different devices. Each token's routing decision triggers all-to-all communication — tokens must be sent to whichever device holds their assigned expert, and results must be returned. This communication overhead can dominate latency, especially at low batch sizes.

Offloading

For consumer hardware, expert offloading keeps only the active experts in GPU VRAM and loads others from CPU RAM or SSD on demand. Libraries like llama.cpp and ExLlamaV2 implement MoE-aware offloading that predicts which experts will be needed and pre-fetches them, reducing the latency penalty.

Quantisation

MoE models benefit particularly from quantisation (reducing parameter precision from 16-bit to 4-bit or lower), because the memory savings apply to the large total parameter count while active compute remains sparse. This makes models like Mixtral 8×7B runnable on consumer GPUs in quantised form.

Advantages and costs

Benefits include:

  • Higher capacity at fixed inference compute: empirically improves quality on knowledge-heavy benchmarks, because the total parameter count acts as a knowledge store.
  • Natural specialisation: experts learn different linguistic, domain, or syntactic regularities without explicit supervision.
  • Training efficiency: MoE models achieve a given quality level with fewer training FLOPs than equivalent dense models, because each token trains only a subset of parameters.

Costs include:

  • Memory: total parameters, not active parameters, determine memory requirements.
  • Communication: expert parallelism requires all-to-all communication, which can bottleneck throughput.
  • Batch sensitivity: per-token routing makes batch composition uneven; serving engines need specialised MoE-aware schedulers.
  • Fine-tuning difficulty: fine-tuning MoE models can be unstable because gradient updates are sparse (each example only updates the activated experts), and routing decisions may shift during fine-tuning.

Scaling laws

Empirical studies suggest that MoE models follow modified scaling laws: for a fixed compute budget, increasing the number of experts (and thus total parameters) improves performance, but with diminishing returns beyond a certain expert count. The optimal ratio of total-to-active parameters depends on the task distribution and available memory.[6]

See also

References

  1. Jacobs, Robert A.; Jordan, Michael I.; Nowlan, Steven J.; Hinton, Geoffrey E. (1991). "Adaptive Mixtures of Local Experts." Neural Computation 3(1): 79–87.
  2. Shazeer, Noam, et al. (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017.
  3. Fedus, William; Zoph, Barret; Shazeer, Noam (2021). "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." arXiv:2101.03961.
  4. Zhou, Yanqi, et al. (2022). "Mixture-of-Experts with Expert Choice Routing." NeurIPS 2022.
  5. Puigcerver, Joan, et al. (2023). "From Sparse to Soft Mixtures of Experts." ICLR 2024.
  6. Clark, Aidan, et al. (2022). "Unified Scaling Laws for Routed Language Models." ICML 2022.