Attention (machine learning)

Template:About

Attention is a machine learning technique that allows a model to dynamically focus on the most relevant parts of its input when producing each part of its output. Since the introduction of the Transformer architecture in 2017, attention — and specifically self-attention — has become the dominant mechanism underlying virtually all modern large language models, image models, and multimodal systems.

Attention computes a weighted sum of values, where the weights are learned as a function of the similarity between a query and a set of keys. This allows the model to route information flexibly across arbitrary positions in a sequence, replacing the strictly local or sequential information flow of earlier architectures such as convolutional neural networks and recurrent neural networks.

History

Alignment models (2014–2015)

The term "attention" in deep learning originates in the 2014 paper Neural Machine Translation by Jointly Learning to Align and Translate by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.^[1] Their model augmented an encoder–decoder recurrent neural network with a soft alignment layer: for each output word, the decoder computed weights over all encoder hidden states and used their weighted sum as a context vector. This removed the fixed-length bottleneck of earlier sequence-to-sequence models and substantially improved translation quality on long sentences.

A related formulation appeared shortly after in Luong et al. (2015), which distinguished "global" from "local" attention and introduced simpler dot-product and general (bilinear) scoring functions.^[2]

The Transformer (2017)

In 2017, researchers at Google Brain and Google Research published Attention Is All You Need, which dispensed with recurrence entirely and built an encoder–decoder model out of stacked self-attention and feed-forward layers.^[3] The paper introduced scaled dot-product attention and multi-head attention, and showed that the resulting Transformer model achieved state-of-the-art translation quality while being vastly more parallelisable on GPUs than recurrent networks.

This work is widely credited as the architectural foundation of the subsequent large language model era, including BERT (2018), the GPT series (2018–), AlphaFold (2020), and ChatGPT (2022).

Later refinements

Research after 2017 focused largely on scaling attention to long sequences and reducing its quadratic cost. Notable variants include sparse attention (Child et al., 2019), Reformer's LSH-based attention (Kitaev et al., 2020), Longformer (Beltagy et al., 2020), Performer (Choromanski et al., 2020), and FlashAttention (Dao et al., 2022), which reorganises the computation for hardware efficiency without changing the mathematical result.^[4]

Rotary positional embedding (RoPE), introduced by Su et al. in the RoFormer paper (2021), has become the standard way of injecting position information into attention in modern LLMs including LLaMA and Claude.^[5]

Scaled dot-product attention

The canonical form of attention used throughout modern deep learning is scaled dot-product attention. Given three matrices — a set of queries <math>Q \in \mathbb{R}^{n \times d_k}</math>, keys <math>K \in \mathbb{R}^{m \times d_k}</math>, and values <math>V \in \mathbb{R}^{m \times d_v}</math> — the output is

<math>\operatorname{Attention}(Q, K, V) = \operatorname{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right) V.</math>

The matrix <math>QK^{\top}</math> contains the pairwise similarity scores between queries and keys. Dividing by <math>\sqrt{d_k}</math> prevents the dot products from growing large in magnitude as <math>d_k</math> increases, which would otherwise push the softmax into regions with extremely small gradients. The softmax converts the scores to a probability distribution over keys, and the output is the corresponding weighted combination of values.

Self-attention

In self-attention, the queries, keys, and values are all linear projections of the same input sequence <math>X</math>:

where <math>W_Q</math>, <math>W_K</math>, <math>W_V</math> are learned parameter matrices. This allows every position in a sequence to attend to every other position, giving the model a context window with no inherent locality bias.

Cross-attention

In cross-attention, the queries come from one sequence (for example a decoder) and the keys and values come from another (for example the encoder's output). This is the form used in the original Bahdanau et al. alignment model and in the decoder of the original Transformer.

Masked (causal) attention

For autoregressive language models, attention must be causally masked so that position <math>i</math> cannot attend to positions <math>j > i</math>. This is typically implemented by adding <math>-\infty</math> to the pre-softmax scores at disallowed positions, so they receive zero weight after the softmax. All GPT-family models and Claude use causal self-attention in their decoder-only stacks.

Multi-head attention

Instead of performing one attention operation with high-dimensional queries and keys, modern models run several attention operations in parallel with lower-dimensional projections, then concatenate the results. Given <math>h</math> heads,

<math>\operatorname{MultiHead}(Q, K, V) = \operatorname{Concat}(\text{head}_1, \ldots, \text{head}_h)\, W_O,</math>

where each <math>\text{head}_i = \operatorname{Attention}(QW_Q^{(i)}, KW_K^{(i)}, VW_V^{(i)})</math>.^[3]

Heads can specialise: different heads have been observed to track syntactic relations, coreference, positional offsets, or repeated tokens — although interpretability research shows that the mapping from heads to human-legible functions is noisier than early work suggested (see mechanistic interpretability).

Grouped and multi-query attention

A practical inefficiency of multi-head attention is that every head maintains its own key and value tensors, which dominate memory during autoregressive decoding. Multi-query attention (MQA, Shazeer 2019) shares a single K/V pair across all heads, and grouped-query attention (GQA, Ainslie et al. 2023) interpolates between MQA and full multi-head. GQA is used in LLaMA 2 and later open-weight models.^[6]

Computational properties

Complexity

Self-attention over a sequence of length <math>n</math> has <math>O(n^2 d)</math> time complexity and <math>O(n^2)</math> memory complexity for the attention matrix. This quadratic scaling is the primary obstacle to extending context windows, and motivates both approximate-attention variants and hardware-aware exact algorithms such as FlashAttention.

Parallelism

Unlike recurrent neural networks, self-attention has no sequential dependency across positions in the forward pass, which makes it highly parallelisable on GPUs and TPUs. This is the principal practical reason Transformers replaced RNNs as the default sequence model: they make better use of modern accelerators.

Position information

Attention itself is permutation-equivariant: if the inputs are shuffled, the outputs are shuffled in the same way. Models therefore inject explicit positional information, either as additive sinusoidal or learned position embeddings (original Transformer, BERT), or as rotations applied inside the attention computation (RoPE — used in most modern LLMs), or as learned bias terms on the attention scores (ALiBi).

Applications beyond language

Vision — The Vision Transformer (ViT, Dosovitskiy et al. 2020) applies self-attention to sequences of image patches and matches or exceeds convolutional neural network performance at scale.
Protein structure — DeepMind's AlphaFold 2 uses attention-based "Evoformer" blocks to reason jointly over sequence and pairwise residue representations.
Speech and audio — Whisper (OpenAI, 2022), Conformer, and AudioLM all rely on attention.
Multimodal models — Claude, GPT-4, and Gemini use attention to fuse text, image, and other modalities.

References

↑ Bahdanau, D., Cho, K., & Bengio, Y. (2014). "Neural Machine Translation by Jointly Learning to Align and Translate". arXiv:1409.0473.
↑ Luong, M.-T., Pham, H., & Manning, C. D. (2015). "Effective Approaches to Attention-based Neural Machine Translation". Proceedings of EMNLP 2015.
↑ ^3.0 ^3.1 Vaswani, A. et al. (2017). "Attention Is All You Need". NeurIPS 2017. arXiv:1706.03762.
↑ Dao, T. et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness". NeurIPS 2022.
↑ Su, J. et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding". arXiv:2104.09864.
↑ Ainslie, J. et al. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints". arXiv:2305.13245.

[bahdanau-1] Bahdanau, D., Cho, K., & Bengio, Y. (2014). "Neural Machine Translation by Jointly Learning to Align and Translate". arXiv:1409.0473.

[luong-2] Luong, M.-T., Pham, H., & Manning, C. D. (2015). "Effective Approaches to Attention-based Neural Machine Translation". Proceedings of EMNLP 2015.

[vaswani-3] 3.0 ^3.1 Vaswani, A. et al. (2017). "Attention Is All You Need". NeurIPS 2017. arXiv:1706.03762.

[4] Dao, T. et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness". NeurIPS 2022.

[5] Su, J. et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding". arXiv:2104.09864.

[6] Ainslie, J. et al. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints". arXiv:2305.13245.

[1]

[2]

[3]

[4]

[5]

[6]