Backpropagation

Backpropagation (backward propagation of errors, commonly abbreviated as backprop) is the fundamental algorithm for training artificial neural networks. It computes the gradient of a loss function with respect to every weight in the network by applying the chain rule of calculus from the output layer backward through the network. These gradients are then used by an optimization algorithm (typically a variant of stochastic gradient descent) to update the weights and reduce the loss. Backpropagation is essential to virtually all deep learning — without it, training networks with millions or billions of parameters would be computationally infeasible.

Overview

Training a neural network is an optimization problem: given a dataset, find the set of weights that minimizes a loss function (a measure of how wrong the network's predictions are). Backpropagation solves the key computational challenge — efficiently computing the gradient of the loss with respect to each of potentially billions of parameters.

The algorithm works in two phases:

Forward pass: Input data flows through the network layer by layer, producing an output prediction. Intermediate activations at each layer are stored.
Backward pass: The gradient of the loss with respect to the output is computed, then propagated backward through the network. At each layer, the chain rule is applied to compute local gradients, which are combined to produce the gradient with respect to that layer's weights and the gradient to pass to the preceding layer.

The computational cost of the backward pass is approximately twice that of the forward pass, making the total cost of one training step roughly three forward passes.

Mathematical formulation

Consider a network with L layers, where layer l has weights W^(l), biases b^(l), pre-activation z^(l) = W^(l) a^(l-1) + b^(l), and activation a^(l) = f(z^(l)) for some activation function f.

Given a loss function L(a^(L), y) comparing the network output to the target:

Step 1 — Output layer error:

δ^(L) = ∂L/∂z^(L) = ∂L/∂a^(L) ⊙ f′(z^(L))

Step 2 — Propagate error backward:

For l = L−1, L−2, …, 1:

δ^(l) = (W^(l+1))^T δ^(l+1) ⊙ f′(z^(l))

Step 3 — Compute weight gradients:

∂L/∂W^(l) = δ^(l) (a^(l-1))^T

∂L/∂b^(l) = δ^(l)

where ⊙ denotes element-wise multiplication.

History

Independent discoveries

The chain rule application to multi-layer networks was independently discovered multiple times:

1960s–1970s: Several researchers in control theory derived forms of backpropagation, including Henry J. Kelley (1960), Arthur Bryson (1961), and Stuart Dreyfus (1962) in the context of optimal control.
1970: Seppo Linnainmaa described automatic differentiation using the chain rule in his master's thesis, which is the general mathematical procedure that backpropagation implements.
1974: Paul Werbos described applying backpropagation to neural networks in his PhD thesis at Harvard, the first known application to neural networks specifically.
1982: Paul Werbos applied backprop to neural networks in a published paper.

Popularization

1986: David Rumelhart, Geoffrey Hinton, and Ronald Williams published "Learning representations by back-propagating errors" in Nature. This paper demonstrated that backpropagation could learn useful internal representations in hidden layers and was the breakthrough that made the algorithm widely known. It sparked the "connectionist" revival in AI research.
1989: Yann LeCun applied backpropagation to train convolutional neural networks for handwritten digit recognition, one of the first commercially successful applications.

Modern era

The algorithm itself has not changed since the 1980s. What changed was the scale at which it could be applied:

2012: GPU-accelerated backpropagation enabled training of AlexNet, launching the deep learning revolution.
2015: Automatic differentiation frameworks (Theano, TensorFlow, later PyTorch) eliminated the need to hand-derive gradients, making backpropagation a commodity operation.
2020s: Backpropagation is applied to networks with hundreds of billions of parameters across thousands of GPUs, with gradient accumulation and mixed-precision arithmetic to manage memory and compute.

Challenges

Vanishing gradients

In deep networks, gradients can shrink exponentially as they propagate backward through many layers, effectively preventing early layers from learning. This was the primary obstacle to training deep networks before 2010.

Solutions include:

ReLU activation: Unlike sigmoid or tanh, ReLU has a gradient of 1 for positive inputs, preventing gradient shrinkage.
Residual connections (He et al., 2015): Skip connections provide a direct gradient path through the network, bypassing the multiplicative chain.
Batch normalization: Normalizing layer inputs keeps gradients in a reasonable range.
Careful initialization: Xavier/Glorot initialization (2010) and He initialization (2015) set initial weight scales to maintain gradient variance across layers.

Exploding gradients

The opposite problem — gradients growing exponentially — causes numerical overflow and training instability. This is particularly common in recurrent neural networks.

Solutions include:

Gradient clipping: Scaling the gradient vector if its norm exceeds a threshold (standard practice for RNN and Transformer training).
Weight normalization: Constraining weight magnitudes.
LSTM gates: The LSTM architecture was specifically designed to control gradient flow through time.

Memory requirements

Backpropagation requires storing all intermediate activations from the forward pass, as they are needed to compute gradients. For large models, this dominates GPU memory usage.

Techniques to reduce memory:

Gradient checkpointing (also called activation recomputation): Store only a subset of activations and recompute the rest during the backward pass. Trades ~30% extra compute for ~60–80% memory reduction.
Mixed-precision training: Store activations in 16-bit (float16 or bfloat16) instead of 32-bit floating point, halving memory.

Biological implausibility

Backpropagation requires propagating error signals backward through the exact same weights used in the forward pass (the "weight transport problem"). Biological neurons do not appear to work this way. This has motivated research into biologically plausible alternatives such as:

Feedback alignment (Lillicrap et al., 2016): Using random fixed feedback weights instead of transposing the forward weights.
Predictive coding: Local learning rules based on prediction errors.
Forward-forward algorithm (Hinton, 2022): Training using two forward passes (positive and negative) without a backward pass.

None of these alternatives has yet matched backpropagation's effectiveness at scale.

Automatic differentiation

Modern deep learning frameworks implement backpropagation through automatic differentiation (autodiff), specifically reverse-mode autodiff, which is mathematically equivalent to backpropagation generalized to arbitrary computation graphs.

Key frameworks:

PyTorch (2017, Meta): Uses dynamic computation graphs ("define-by-run"), building the graph during the forward pass. Dominant in research.
TensorFlow (2015, Google): Originally used static graphs; TensorFlow 2.0 adopted eager execution similar to PyTorch.
JAX (2018, Google): Functional autodiff with composable transformations (grad, jit, vmap, pmap).

These frameworks allow researchers to define arbitrarily complex model architectures while the gradient computation is handled automatically and efficiently.

Relationship to gradient descent

Backpropagation computes gradients; it does not by itself update weights. The weight update is performed by an optimizer:

Stochastic gradient descent (SGD): w ← w − η ∇L, where η is the learning rate.
SGD with momentum: Accumulates a running average of past gradients to smooth updates and accelerate convergence.
Adam (Kingma & Ba, 2015): Adapts the learning rate for each parameter based on estimates of first and second moments of the gradients. The default optimizer for most deep learning tasks.
AdamW (Loshchilov & Hutter, 2019): Adam with decoupled weight decay. Standard for Transformer training.

References

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). "Learning representations by back-propagating errors". Nature, 323(6088), 533–536.
Werbos, P. (1974). "Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences". PhD thesis, Harvard University.
Linnainmaa, S. (1970). "The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors". Master's thesis, University of Helsinki.
LeCun, Y. et al. (1989). "Backpropagation Applied to Handwritten Zip Code Recognition". Neural Computation, 1(4), 541–551.
Kingma, D. P. & Ba, J. (2015). "Adam: A Method for Stochastic Optimization". ICLR 2015.
Loshchilov, I. & Hutter, F. (2019). "Decoupled Weight Decay Regularization". ICLR 2019.
He, K. et al. (2016). "Deep Residual Learning for Image Recognition". CVPR 2016.
Lillicrap, T. P. et al. (2016). "Random synaptic feedback weights support error backpropagation for deep learning". Nature Communications, 7, 13276.
Hinton, G. (2022). "The Forward-Forward Algorithm: Some Preliminary Investigations". arXiv:2212.13345.