Transformer (machine learning)

From OpenEncyclopedia

Template:Infobox software

The transformer is a deep learning architecture introduced in 2017 by researchers at Google Brain and Google Research. It is the foundation of virtually all modern large language models (LLMs), including GPT, Claude, Gemini, and LLaMA, as well as influential models in computer vision, protein folding, and other domains.

The transformer was first described in the paper "Attention Is All You Need" by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, published at the Conference on Neural Information Processing Systems (NeurIPS) in December 2017.[1] The architecture replaced earlier recurrent neural network (RNN) and long short-term memory (LSTM) approaches that had dominated natural language processing (NLP), offering dramatically better parallelisation and the ability to model long-range dependencies in sequences.

Architecture

Self-attention mechanism

The central innovation of the transformer is the self-attention (or scaled dot-product attention) mechanism, which allows every element in a sequence to attend to every other element simultaneously, rather than processing tokens one at a time as RNNs do. For a given input sequence, self-attention computes three vectors for each token—a query, a key, and a value—and produces an output by taking a weighted sum of the value vectors, where the weights are determined by the compatibility between the query of one token and the keys of all other tokens.

Mathematically, for query matrix Q, key matrix K, and value matrix V, the attention function is:

<math>\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V</math>

where dk is the dimensionality of the key vectors. The scaling factor prevents the dot products from growing too large in magnitude, which would push the softmax into regions with extremely small gradients.

Multi-head attention

Rather than computing a single attention function, the transformer employs multi-head attention, which runs several attention functions in parallel (each with its own learned linear projections), then concatenates and linearly transforms the results. This allows the model to jointly attend to information from different representation subspaces at different positions.

Encoder-decoder structure

The original transformer uses an encoder-decoder design:

  • The encoder consists of a stack of identical layers, each containing a multi-head self-attention sublayer followed by a position-wise feed-forward network. Each sublayer uses a residual connection and layer normalisation.
  • The decoder mirrors the encoder but includes an additional cross-attention sublayer that attends to the encoder output. The decoder's self-attention is masked so that each position can only attend to earlier positions, preserving the autoregressive property needed for generation.

Positional encoding

Because the self-attention mechanism is permutation-invariant (it has no inherent notion of token order), the transformer adds positional encodings to the input embeddings. The original paper used fixed sinusoidal functions of different frequencies, though later models have adopted learned positional embeddings (BERT, GPT-2) or rotary positional embeddings (RoPE, used in LLaMA and many recent models).

Variants

Encoder-only models

BERT (Bidirectional Encoder Representations from Transformers), released by Google in 2018, uses only the encoder portion. BERT is trained with a masked language modelling objective—randomly masking tokens in the input and predicting them—which allows it to learn bidirectional representations. BERT and its derivatives (RoBERTa, ALBERT, DeBERTa) dominated NLP benchmarks from 2018 to 2022 and remain widely used for classification, named entity recognition, and sentence embedding tasks.

Decoder-only models

The GPT (Generative Pre-trained Transformer) series from OpenAI, beginning with GPT-1 in 2018, uses only the decoder portion, trained autoregressively to predict the next token. This architecture has proven to be the most effective for text generation at scale and is used by the majority of frontier large language models in 2025, including GPT-4, Claude, Gemini, and LLaMA.

Encoder-decoder models

Some models retain the full encoder-decoder structure. Google's T5 (Text-to-Text Transfer Transformer, 2019) frames all NLP tasks as text-to-text problems, allowing a single model architecture to handle translation, summarisation, classification, and question answering.

Scaling and impact

The transformer architecture exhibits predictable scaling laws: model performance (measured by loss on held-out data) improves as a smooth power-law function of model size, dataset size, and compute budget, as characterised by Kaplan et al. (2020) at OpenAI and Hoffmann et al. (2022) at Google DeepMind (the "Chinchilla" scaling laws).[2][3]

This predictability has driven a rapid increase in model scale:

Year Model Parameters Organisation
2017 Original Transformer 65 million Google
2018 GPT-1 117 million OpenAI
2019 GPT-2 1.5 billion OpenAI
2020 GPT-3 175 billion OpenAI
2023 LLaMA 2 70B 70 billion Meta AI
2024 LLaMA 3.1 405B 405 billion Meta AI

Beyond language

While originally designed for machine translation, the transformer has been successfully adapted to numerous other domains:

  • Computer vision — The Vision Transformer (ViT, 2020) treats an image as a sequence of patches and applies standard transformer layers, achieving competitive results with convolutional neural networks on image classification.
  • Protein structure predictionAlphaFold 2 (2020) and AlphaFold 3 (2024), developed by Google DeepMind, use transformer-derived architectures to predict three-dimensional protein structures with near-experimental accuracy.
  • Audio and speech — OpenAI's Whisper speech recognition model and various text-to-speech systems use transformer architectures.
  • Multimodal models — Modern frontier models such as GPT-4, Gemini, and Claude process text, images, and other modalities through unified transformer-based architectures.

Efficiency research

The standard self-attention mechanism has O(n²) time and memory complexity with respect to sequence length n, which limits the practical context window of transformer models. Numerous approaches have been proposed to address this:

  • Sparse attention — attending only to a subset of positions (e.g. Longformer, BigBird)
  • Linear attention — replacing softmax attention with kernelised approximations to achieve O(n) complexity
  • FlashAttention — an exact attention algorithm by Tri Dao et al. (2022) that achieves significant wall-clock speedups by minimising memory reads/writes through careful tiling, without approximation[4]
  • Mixture of Experts (MoE) — routing each token to a subset of available parameters, allowing models with very large total parameter counts to remain computationally tractable (used in Mixtral, and reportedly in GPT-4)

Legacy

The transformer is arguably the single most influential machine learning architecture of the 2020s. Its combination of parallelisable training, effective scaling behaviour, and adaptability across modalities has made it the default backbone for virtually all frontier AI systems. The paper "Attention Is All You Need" had accumulated over 140,000 citations on Google Scholar by early 2026, making it one of the most cited computer science papers in history.

See also

References

Template:Reflist