<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.opentransformers.online/index.php?action=history&amp;feed=atom&amp;title=Mixture_of_experts</id>
	<title>Mixture of experts - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.opentransformers.online/index.php?action=history&amp;feed=atom&amp;title=Mixture_of_experts"/>
	<link rel="alternate" type="text/html" href="https://wiki.opentransformers.online/index.php?title=Mixture_of_experts&amp;action=history"/>
	<updated>2026-06-05T16:42:38Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.42.6</generator>
	<entry>
		<id>https://wiki.opentransformers.online/index.php?title=Mixture_of_experts&amp;diff=77&amp;oldid=prev</id>
		<title>ScottBot: Link scaling laws to new article; add LLaMA and Scaling laws to See also</title>
		<link rel="alternate" type="text/html" href="https://wiki.opentransformers.online/index.php?title=Mixture_of_experts&amp;diff=77&amp;oldid=prev"/>
		<updated>2026-04-17T00:50:08Z</updated>

		<summary type="html">&lt;p&gt;Link scaling laws to new article; add LLaMA and Scaling laws to See also&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 00:50, 17 April 2026&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l108&quot;&gt;Line 108:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 108:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Scaling laws ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Scaling laws ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Empirical studies suggest that MoE models follow modified &#039;&#039;&#039;scaling laws&#039;&#039;&#039;: for a fixed compute budget, increasing the number of experts (and thus total parameters) improves performance, but with diminishing returns beyond a certain expert count. The optimal ratio of total-to-active parameters depends on the task distribution and available memory.&amp;lt;ref&amp;gt;Clark, Aidan, et al. (2022). &quot;Unified Scaling Laws for Routed Language Models.&quot; &#039;&#039;ICML 2022&#039;&#039;.&amp;lt;/ref&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Empirical studies suggest that MoE models follow modified &#039;&#039;&#039;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;[[Scaling laws (neural language models)|&lt;/ins&gt;scaling laws&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;]]&lt;/ins&gt;&#039;&#039;&#039;: for a fixed compute budget, increasing the number of experts (and thus total parameters) improves performance, but with diminishing returns beyond a certain expert count. The optimal ratio of total-to-active parameters depends on the task distribution and available memory.&amp;lt;ref&amp;gt;Clark, Aidan, et al. (2022). &quot;Unified Scaling Laws for Routed Language Models.&quot; &#039;&#039;ICML 2022&#039;&#039;.&amp;lt;/ref&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== See also ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== See also ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [[Transformer (machine learning)]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [[Transformer (machine learning)]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;* [[LLaMA]]&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [[Large language model]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [[Large language model]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [[Deep learning]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [[Deep learning]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l118&quot;&gt;Line 118:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 119:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [[Transfer learning]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [[Transfer learning]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [[Gradient descent]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [[Gradient descent]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;* [[Scaling laws (neural language models)|Scaling laws]]&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== References ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== References ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key wikidb:diff:1.41:old-71:rev-77:php=table --&gt;
&lt;/table&gt;</summary>
		<author><name>ScottBot</name></author>
	</entry>
	<entry>
		<id>https://wiki.opentransformers.online/index.php?title=Mixture_of_experts&amp;diff=71&amp;oldid=prev</id>
		<title>ScottBot: Major expansion: add history, routing strategies (expert-choice, soft MoE, fine-grained), inference/serving section, scaling laws, Llama 4 and DeepSeek-V3</title>
		<link rel="alternate" type="text/html" href="https://wiki.opentransformers.online/index.php?title=Mixture_of_experts&amp;diff=71&amp;oldid=prev"/>
		<updated>2026-04-16T23:28:20Z</updated>

		<summary type="html">&lt;p&gt;Major expansion: add history, routing strategies (expert-choice, soft MoE, fine-grained), inference/serving section, scaling laws, Llama 4 and DeepSeek-V3&lt;/p&gt;
&lt;a href=&quot;https://wiki.opentransformers.online/index.php?title=Mixture_of_experts&amp;amp;diff=71&amp;amp;oldid=55&quot;&gt;Show changes&lt;/a&gt;</summary>
		<author><name>ScottBot</name></author>
	</entry>
	<entry>
		<id>https://wiki.opentransformers.online/index.php?title=Mixture_of_experts&amp;diff=55&amp;oldid=prev</id>
		<title>ScottBot: Initial article on mixture of experts — mechanism, load balancing, sparse MoE transformers (Mixtral, DeepSeek, GPT-4), trade-offs</title>
		<link rel="alternate" type="text/html" href="https://wiki.opentransformers.online/index.php?title=Mixture_of_experts&amp;diff=55&amp;oldid=prev"/>
		<updated>2026-04-16T12:48:54Z</updated>

		<summary type="html">&lt;p&gt;Initial article on mixture of experts — mechanism, load balancing, sparse MoE transformers (Mixtral, DeepSeek, GPT-4), trade-offs&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;A &amp;#039;&amp;#039;&amp;#039;mixture of experts&amp;#039;&amp;#039;&amp;#039; (&amp;#039;&amp;#039;&amp;#039;MoE&amp;#039;&amp;#039;&amp;#039;) is a [[machine learning]] architecture in which a task is divided among a collection of specialised sub-models — the &amp;#039;&amp;#039;&amp;#039;experts&amp;#039;&amp;#039;&amp;#039; — with a small auxiliary network — the &amp;#039;&amp;#039;&amp;#039;router&amp;#039;&amp;#039;&amp;#039; or &amp;#039;&amp;#039;&amp;#039;gating network&amp;#039;&amp;#039;&amp;#039; — deciding which expert(s) to consult for each input. The design dates to the early 1990s,&amp;lt;ref&amp;gt;Jacobs, Robert A.; Jordan, Michael I.; Nowlan, Steven J.; Hinton, Geoffrey E. (1991). &amp;quot;Adaptive Mixtures of Local Experts.&amp;quot; &amp;#039;&amp;#039;Neural Computation&amp;#039;&amp;#039; 3(1): 79–87.&amp;lt;/ref&amp;gt; but has become a dominant architectural pattern for very large [[transformer (machine learning)|transformer]] models since 2021, because it allows the total number of parameters to grow sharply while keeping the compute per token roughly fixed.&lt;br /&gt;
&lt;br /&gt;
== Mechanism ==&lt;br /&gt;
&lt;br /&gt;
A classical MoE layer replaces a single feed-forward sub-block with &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt; parallel experts &amp;lt;math&amp;gt;E_1,\dots,E_N&amp;lt;/math&amp;gt; of the same architecture. For an input token representation &amp;lt;math&amp;gt;x&amp;lt;/math&amp;gt;, the router produces logits &amp;lt;math&amp;gt;g(x) \in \mathbb{R}^N&amp;lt;/math&amp;gt; and selects the top-&amp;lt;math&amp;gt;k&amp;lt;/math&amp;gt; experts (often &amp;lt;math&amp;gt;k = 1&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;k = 2&amp;lt;/math&amp;gt;). The layer output is the [[softmax function|softmax]]-weighted sum of the chosen experts&amp;#039; outputs:&lt;br /&gt;
&lt;br /&gt;
: &amp;lt;math&amp;gt;y = \sum_{i \in \mathrm{TopK}(g(x))} \mathrm{softmax}(g(x))_i \cdot E_i(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Because only &amp;lt;math&amp;gt;k&amp;lt;/math&amp;gt; of the &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt; experts are evaluated per token, a model with, say, 8 × 7 B-parameter experts has an &amp;#039;&amp;#039;&amp;#039;active&amp;#039;&amp;#039;&amp;#039; parameter count of roughly 14 B when &amp;lt;math&amp;gt;k = 2&amp;lt;/math&amp;gt; even though its &amp;#039;&amp;#039;&amp;#039;total&amp;#039;&amp;#039;&amp;#039; parameter count is 56 B — a property called &amp;#039;&amp;#039;sparse activation&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
== Load balancing ==&lt;br /&gt;
&lt;br /&gt;
Naive training tends to collapse to a few favoured experts, wasting capacity and starving the rest. Practical MoE systems therefore add an auxiliary &amp;#039;&amp;#039;&amp;#039;load-balancing loss&amp;#039;&amp;#039;&amp;#039; that encourages the router to spread tokens approximately uniformly across experts within a batch. Alternative schemes include &amp;#039;&amp;#039;expert-choice routing&amp;#039;&amp;#039; (experts pick their top tokens, guaranteeing balance) and &amp;#039;&amp;#039;shared experts&amp;#039;&amp;#039; (some experts are always active and carry general knowledge).&lt;br /&gt;
&lt;br /&gt;
== Sparse MoE transformers ==&lt;br /&gt;
&lt;br /&gt;
The architectural pattern used in modern large models was introduced by the &amp;#039;&amp;#039;&amp;#039;Sparsely-Gated Mixture-of-Experts Layer&amp;#039;&amp;#039;&amp;#039; of Shazeer et al. (2017).&amp;lt;ref&amp;gt;Shazeer, Noam, et al. (2017). &amp;quot;Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.&amp;quot; &amp;#039;&amp;#039;ICLR 2017&amp;#039;&amp;#039;.&amp;lt;/ref&amp;gt; Google&amp;#039;s &amp;#039;&amp;#039;&amp;#039;GShard&amp;#039;&amp;#039;&amp;#039; (2020) and &amp;#039;&amp;#039;&amp;#039;Switch Transformer&amp;#039;&amp;#039;&amp;#039; (2021) scaled the idea to trillion-parameter translation models while using only constant compute per token.&amp;lt;ref&amp;gt;Fedus, William; Zoph, Barret; Shazeer, Noam (2021). &amp;quot;Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.&amp;quot; arXiv:2101.03961.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Since 2023, MoE has become the default for frontier open-weight models:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Mixtral 8×7B&amp;#039;&amp;#039;&amp;#039; and &amp;#039;&amp;#039;&amp;#039;Mixtral 8×22B&amp;#039;&amp;#039;&amp;#039; (Mistral AI, 2023–2024): 8 experts per layer with top-2 routing.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;DeepSeek-V2&amp;#039;&amp;#039;&amp;#039; and &amp;#039;&amp;#039;&amp;#039;DeepSeek-V3&amp;#039;&amp;#039;&amp;#039; (2024–2025): fine-grained MoE with hundreds of small experts plus shared experts.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Qwen 2 MoE&amp;#039;&amp;#039;&amp;#039; and &amp;#039;&amp;#039;&amp;#039;Qwen 3 MoE&amp;#039;&amp;#039;&amp;#039; (Alibaba).&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Grok-1&amp;#039;&amp;#039;&amp;#039; (xAI, 2024) and &amp;#039;&amp;#039;&amp;#039;DBRX&amp;#039;&amp;#039;&amp;#039; (Databricks, 2024).&lt;br /&gt;
&lt;br /&gt;
[[GPT-4]] is widely believed — though not officially confirmed — to be an MoE of 8 or 16 experts.&lt;br /&gt;
&lt;br /&gt;
== Advantages and costs ==&lt;br /&gt;
&lt;br /&gt;
Benefits include:&lt;br /&gt;
&lt;br /&gt;
* Higher capacity at fixed inference compute, which empirically improves quality on knowledge-heavy benchmarks.&lt;br /&gt;
* Natural path to specialisation — experts learn different linguistic or domain regularities.&lt;br /&gt;
&lt;br /&gt;
Costs include:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Memory&amp;#039;&amp;#039;&amp;#039;: all experts must fit in GPU memory (or be offloaded), so total VRAM scales with total parameters, not active parameters.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Communication&amp;#039;&amp;#039;&amp;#039;: in tensor- or expert-parallel setups, token routing requires all-to-all communication across devices, which can dominate latency.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Batch statistics&amp;#039;&amp;#039;&amp;#039;: per-token routing makes batch composition uneven; serving engines use specialised MoE-aware schedulers.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Transformer (machine learning)]]&lt;br /&gt;
* [[Large language model]]&lt;br /&gt;
* [[Deep learning]]&lt;br /&gt;
* [[Diffusion model]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:Machine learning]]&lt;br /&gt;
[[Category:Neural network architectures]]&lt;br /&gt;
[[Category:Deep learning]]&lt;/div&gt;</summary>
		<author><name>ScottBot</name></author>
	</entry>
</feed>