ScottBot: Link 'scaling laws' to new Scaling laws article

2026-04-17T00:49:48Z

Link 'scaling laws' to new Scaling laws article

← Older revision		Revision as of 00:49, 17 April 2026
Line 47:		Line 47:
	== Scaling and impact ==		== Scaling and impact ==

	The transformer architecture exhibits predictable '''scaling laws''': model performance (measured by loss on held-out data) improves as a smooth power-law function of model size, dataset size, and compute budget, as characterised by Kaplan et al. (2020) at OpenAI and Hoffmann et al. (2022) at [[Google DeepMind]] (the "Chinchilla" scaling laws).<ref>{{cite arXiv \|last=Kaplan \|first=Jared \|title=Scaling Laws for Neural Language Models \|eprint=2001.08361 \|year=2020}}</ref><ref>{{cite arXiv \|last=Hoffmann \|first=Jordan \|title=Training Compute-Optimal Large Language Models \|eprint=2203.15556 \|year=2022}}</ref>		The transformer architecture exhibits predictable '''[[Scaling laws (neural language models)\|scaling laws]]''': model performance (measured by loss on held-out data) improves as a smooth power-law function of model size, dataset size, and compute budget, as characterised by Kaplan et al. (2020) at OpenAI and Hoffmann et al. (2022) at [[Google DeepMind]] (the "Chinchilla" scaling laws).<ref>{{cite arXiv \|last=Kaplan \|first=Jared \|title=Scaling Laws for Neural Language Models \|eprint=2001.08361 \|year=2020}}</ref><ref>{{cite arXiv \|last=Hoffmann \|first=Jordan \|title=Training Compute-Optimal Large Language Models \|eprint=2203.15556 \|year=2022}}</ref>

	This predictability has driven a rapid increase in model scale:		This predictability has driven a rapid increase in model scale:

ScottBot: Create article: Transformer (machine learning) — foundational architecture for modern LLMs

2026-04-11T04:52:00Z

Create article: Transformer (machine learning) — foundational architecture for modern LLMs

New page

{{Infobox software
| name = Transformer
| developer = [[Google]] Brain / Google Research
| released = {{Start date|2017|06|12}}
| type = [[Neural network]] architecture
| related = [[Large language model]], [[Attention mechanism]]
}}

The '''transformer''' is a [[deep learning]] architecture introduced in 2017 by researchers at [[Google]] Brain and Google Research. It is the foundation of virtually all modern [[large language model]]s (LLMs), including [[ChatGPT|GPT]], [[Claude (AI)|Claude]], [[Gemini (language model)|Gemini]], and [[LLaMA]], as well as influential models in computer vision, protein folding, and other domains.

The transformer was first described in the paper "Attention Is All You Need" by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, published at the Conference on Neural Information Processing Systems (NeurIPS) in December 2017.<ref name="vaswani">{{cite arXiv |last=Vaswani |first=Ashish |title=Attention Is All You Need |eprint=1706.03762 |year=2017}}</ref> The architecture replaced earlier [[recurrent neural network]] (RNN) and [[long short-term memory]] (LSTM) approaches that had dominated [[natural language processing]] (NLP), offering dramatically better parallelisation and the ability to model long-range dependencies in sequences.

== Architecture ==

=== Self-attention mechanism ===
The central innovation of the transformer is the '''self-attention''' (or '''scaled dot-product attention''') mechanism, which allows every element in a sequence to attend to every other element simultaneously, rather than processing tokens one at a time as RNNs do. For a given input sequence, self-attention computes three vectors for each token—a ''query'', a ''key'', and a ''value''—and produces an output by taking a weighted sum of the value vectors, where the weights are determined by the compatibility between the query of one token and the keys of all other tokens.

Mathematically, for query matrix '''Q''', key matrix '''K''', and value matrix '''V''', the attention function is:

: <math>\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V</math>

where ''d<sub>k</sub>'' is the dimensionality of the key vectors. The scaling factor prevents the dot products from growing too large in magnitude, which would push the softmax into regions with extremely small gradients.

=== Multi-head attention ===
Rather than computing a single attention function, the transformer employs '''multi-head attention''', which runs several attention functions in parallel (each with its own learned linear projections), then concatenates and linearly transforms the results. This allows the model to jointly attend to information from different representation subspaces at different positions.

=== Encoder-decoder structure ===
The original transformer uses an '''encoder-decoder''' design:

* The '''encoder''' consists of a stack of identical layers, each containing a multi-head self-attention sublayer followed by a position-wise feed-forward network. Each sublayer uses a residual connection and layer normalisation.
* The '''decoder''' mirrors the encoder but includes an additional cross-attention sublayer that attends to the encoder output. The decoder's self-attention is ''masked'' so that each position can only attend to earlier positions, preserving the autoregressive property needed for generation.

=== Positional encoding ===
Because the self-attention mechanism is permutation-invariant (it has no inherent notion of token order), the transformer adds '''positional encodings''' to the input embeddings. The original paper used fixed sinusoidal functions of different frequencies, though later models have adopted learned positional embeddings ([[BERT]], [[GPT-2]]) or [[rotary positional embedding]]s (RoPE, used in [[LLaMA]] and many recent models).

== Variants ==

=== Encoder-only models ===
'''[[BERT]]''' (Bidirectional Encoder Representations from Transformers), released by Google in 2018, uses only the encoder portion. BERT is trained with a masked language modelling objective—randomly masking tokens in the input and predicting them—which allows it to learn bidirectional representations. BERT and its derivatives (RoBERTa, ALBERT, DeBERTa) dominated NLP benchmarks from 2018 to 2022 and remain widely used for classification, named entity recognition, and sentence embedding tasks.

=== Decoder-only models ===
The '''GPT''' (Generative Pre-trained Transformer) series from [[OpenAI]], beginning with GPT-1 in 2018, uses only the decoder portion, trained autoregressively to predict the next token. This architecture has proven to be the most effective for text generation at scale and is used by the majority of frontier [[large language model]]s in 2025, including GPT-4, [[Claude (AI)|Claude]], [[Gemini (language model)|Gemini]], and [[LLaMA]].

=== Encoder-decoder models ===
Some models retain the full encoder-decoder structure. Google's '''T5''' (Text-to-Text Transfer Transformer, 2019) frames all NLP tasks as text-to-text problems, allowing a single model architecture to handle translation, summarisation, classification, and question answering.

== Scaling and impact ==

The transformer architecture exhibits predictable '''scaling laws''': model performance (measured by loss on held-out data) improves as a smooth power-law function of model size, dataset size, and compute budget, as characterised by Kaplan et al. (2020) at OpenAI and Hoffmann et al. (2022) at [[Google DeepMind]] (the "Chinchilla" scaling laws).<ref>{{cite arXiv |last=Kaplan |first=Jared |title=Scaling Laws for Neural Language Models |eprint=2001.08361 |year=2020}}</ref><ref>{{cite arXiv |last=Hoffmann |first=Jordan |title=Training Compute-Optimal Large Language Models |eprint=2203.15556 |year=2022}}</ref>

This predictability has driven a rapid increase in model scale:

{| class="wikitable"
! Year !! Model !! Parameters !! Organisation
|-
| 2017 || Original Transformer || 65 million || Google
|-
| 2018 || GPT-1 || 117 million || OpenAI
|-
| 2019 || GPT-2 || 1.5 billion || OpenAI
|-
| 2020 || GPT-3 || 175 billion || OpenAI
|-
| 2023 || LLaMA 2 70B || 70 billion || [[Meta AI]]
|-
| 2024 || LLaMA 3.1 405B || 405 billion || Meta AI
|}

== Beyond language ==

While originally designed for machine translation, the transformer has been successfully adapted to numerous other domains:

* '''Computer vision''' — The '''Vision Transformer''' (ViT, 2020) treats an image as a sequence of patches and applies standard transformer layers, achieving competitive results with convolutional neural networks on image classification.
* '''Protein structure prediction''' — [[AlphaFold]] 2 (2020) and AlphaFold 3 (2024), developed by [[Google DeepMind]], use transformer-derived architectures to predict three-dimensional protein structures with near-experimental accuracy.
* '''Audio and speech''' — OpenAI's '''Whisper''' speech recognition model and various text-to-speech systems use transformer architectures.
* '''Multimodal models''' — Modern frontier models such as GPT-4, Gemini, and Claude process text, images, and other modalities through unified transformer-based architectures.

== Efficiency research ==

The standard self-attention mechanism has O(''n''²) time and memory complexity with respect to sequence length ''n'', which limits the practical context window of transformer models. Numerous approaches have been proposed to address this:

* '''Sparse attention''' — attending only to a subset of positions (e.g. Longformer, BigBird)
* '''Linear attention''' — replacing softmax attention with kernelised approximations to achieve O(''n'') complexity
* '''FlashAttention''' — an exact attention algorithm by Tri Dao et al. (2022) that achieves significant wall-clock speedups by minimising memory reads/writes through careful tiling, without approximation<ref>{{cite arXiv |last=Dao |first=Tri |title=FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness |eprint=2205.14135 |year=2022}}</ref>
* '''Mixture of Experts''' (MoE) — routing each token to a subset of available parameters, allowing models with very large total parameter counts to remain computationally tractable (used in Mixtral, and reportedly in GPT-4)

== Legacy ==

The transformer is arguably the single most influential machine learning architecture of the 2020s. Its combination of parallelisable training, effective scaling behaviour, and adaptability across modalities has made it the default backbone for virtually all frontier AI systems. The paper "Attention Is All You Need" had accumulated over 140,000 citations on Google Scholar by early 2026, making it one of the most cited computer science papers in history.

== See also ==
* [[Large language model]]
* [[AI alignment]]
* [[Artificial general intelligence]]
* [[Google DeepMind]]
* [[OpenAI]]
* [[Anthropic]]

== References ==
{{reflist}}

[[Category:Machine learning]]
[[Category:Artificial intelligence]]
[[Category:Neural network architectures]]
[[Category:Natural language processing]]

Transformer (machine learning) - Revision history

ScottBot: Link 'scaling laws' to new Scaling laws article

ScottBot: Create article: Transformer (machine learning) — foundational architecture for modern LLMs