Gradient descent
Gradient descent is a first-order iterative optimisation algorithm for finding a local minimum of a differentiable function. In machine learning and deep learning, it is the principal method for training models: the algorithm repeatedly adjusts a model's parameters in the direction that most steeply reduces a loss function — a scalar measure of the gap between the model's predictions and the desired outputs. Virtually every modern artificial neural network, from simple logistic regression to billion-parameter large language models, is trained by some variant of gradient descent.
Mathematical formulation
Given a differentiable loss function <math>L(\theta)</math> over parameters <math>\theta \in \mathbb{R}^n</math>, the gradient descent update rule is:
- <math>\theta_{t+1} = \theta_t - \eta \, \nabla_\theta L(\theta_t)</math>
where <math>\eta > 0</math> is the learning rate (step size) and <math>\nabla_\theta L</math> is the gradient — the vector of partial derivatives of the loss with respect to each parameter. The negative sign ensures the parameters move downhill on the loss surface.
The algorithm converges to a local minimum for convex functions and, under mild conditions, to a stationary point for non-convex functions. In practice, the loss landscapes of deep neural networks are highly non-convex with many saddle points, but gradient descent (and especially its stochastic variants) empirically finds good solutions.[1]
Variants
Batch gradient descent
Batch (or full-batch) gradient descent computes the gradient over the entire training set before each update. This gives an exact gradient but is computationally prohibitive for large datasets, since every parameter update requires a full pass through the data.
Stochastic gradient descent
Stochastic gradient descent (SGD) estimates the gradient from a single randomly sampled training example (or a very small subset). The update is noisy but much cheaper per step, and the noise can help escape shallow local minima and saddle points.[2]
Mini-batch gradient descent
In practice, nearly all modern training uses mini-batch gradient descent — a compromise in which the gradient is computed over a small batch of <math>B</math> examples (typically 32–8192). Mini-batches exploit GPU parallelism, reduce gradient variance relative to pure SGD, and are the standard in frameworks such as PyTorch and TensorFlow.
Learning rate
The learning rate <math>\eta</math> is the single most important hyperparameter. Too large, and training diverges or oscillates; too small, and convergence is impractically slow.
Common learning rate schedules include:
- Step decay: multiply <math>\eta</math> by a factor (e.g. 0.1) every fixed number of epochs.
- Cosine annealing: smoothly decay <math>\eta</math> following a cosine curve, often with warm restarts.[3]
- Linear warmup: start from a very small <math>\eta</math> and increase linearly over the first few thousand steps, then decay. This is standard for transformer training.
- One-cycle policy: ramp up then ramp down over a single training run; introduced by Leslie Smith (2018).
Momentum and acceleration
Classical momentum
Momentum (Polyak, 1964) augments SGD with an exponentially decaying moving average of past gradients, smoothing oscillations and accelerating convergence along consistent gradient directions:
- <math>v_{t+1} = \mu \, v_t + \nabla_\theta L(\theta_t)</math>
- <math>\theta_{t+1} = \theta_t - \eta \, v_{t+1}</math>
where <math>\mu \in [0,1)</math> is the momentum coefficient, typically 0.9.
Nesterov accelerated gradient
Nesterov momentum (1983) evaluates the gradient at a look-ahead position <math>\theta_t - \eta \mu v_t</math> rather than the current position, yielding faster convergence for convex problems and modestly better results in deep learning.[4]
Adaptive learning rate methods
A family of algorithms that maintain per-parameter learning rates, automatically scaling updates based on the history of gradients.
AdaGrad (2011)
AdaGrad accumulates the sum of squared gradients for each parameter and divides the learning rate by its square root, giving smaller updates to frequently updated parameters. This is effective for sparse data (e.g. NLP, recommender systems) but can prematurely shrink the learning rate to zero.[5]
RMSProp (2012)
RMSProp (Hinton, unpublished lecture notes) addresses AdaGrad's decay problem by replacing the cumulative sum with an exponentially weighted moving average of squared gradients, keeping the effective learning rate bounded.
Adam (2014)
Adam (Adaptive Moment Estimation) combines momentum (first moment) with RMSProp-style second-moment scaling, plus bias correction for the initial steps:[6]
- <math>m_t = \beta_1 m_{t-1} + (1-\beta_1) \nabla L</math>
- <math>v_t = \beta_2 v_{t-1} + (1-\beta_2) (\nabla L)^2</math>
- <math>\hat{m}_t = m_t / (1 - \beta_1^t), \quad \hat{v}_t = v_t / (1 - \beta_2^t)</math>
- <math>\theta_{t+1} = \theta_t - \eta \, \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)</math>
Adam is the default optimiser for most transformer and large language model training runs.
AdamW (2017)
Loshchilov and Hutter showed that Adam's weight decay implementation was incorrect (it applied L2 regularisation to the adaptive gradient rather than the raw parameters) and proposed AdamW, which decouples weight decay from the gradient update.[7] AdamW is the standard for training BERT, GPT-3, GPT-4, and most modern LLMs.
Gradient computation: backpropagation
In neural networks, the gradient <math>\nabla_\theta L</math> is computed efficiently via backpropagation — the chain rule applied layer by layer from the output back to the input. This reduces the cost of computing the gradient from <math>O(n^2)</math> (numerical differentiation) to <math>O(n)</math> (one backward pass through the network). Modern frameworks (PyTorch, JAX, TensorFlow) implement this as automatic differentiation.
Challenges in deep learning
- Vanishing and exploding gradients: In very deep networks, gradients can shrink exponentially (vanish) or grow exponentially (explode) as they propagate through layers. Mitigations include careful initialisation (Xavier, He), residual connections, gradient clipping, and normalisation layers.
- Saddle points: High-dimensional loss surfaces have exponentially more saddle points than local minima. SGD's noise helps escape them, and adaptive methods partially address the issue.
- Sharpness and generalisation: Flatter minima tend to generalise better than sharp ones. Techniques like sharpness-aware minimisation (SAM) explicitly seek flat regions.[8]
- Large-batch training: Training with very large batches (32K+ examples) can degrade generalisation. Techniques like LARS, LAMB, and learning rate scaling rules partially mitigate this.[9]
History
- 1847: Augustin-Louis Cauchy described the method of steepest descent for minimising functions.
- 1951: Herbert Robbins and Sutton Monro introduced stochastic approximation, the theoretical foundation for SGD.[10]
- 1964: Boris Polyak introduced the heavy-ball (momentum) method.
- 1983: Yurii Nesterov proposed accelerated gradient methods with provably faster convergence.
- 1986: Rumelhart, Hinton, and Williams popularised backpropagation for computing gradients in neural networks, making gradient descent practical for multi-layer models.
- 2011–2014: The adaptive methods era: AdaGrad (2011), RMSProp (2012), Adam (2014).
- 2017–present: Large-scale training drives innovations in learning rate scheduling (cosine with warmup), decoupled weight decay (AdamW), distributed optimisation (LARS, LAMB), and sharpness-aware methods (SAM).
See also
References
- ↑ Choromanska, Anna, et al. (2015). "The Loss Surfaces of Multilayer Networks." AISTATS 2015.
- ↑ Bottou, Léon (2010). "Large-Scale Machine Learning with Stochastic Gradient Descent." COMPSTAT 2010.
- ↑ Loshchilov, Ilya; Hutter, Frank (2017). "SGDR: Stochastic Gradient Descent with Warm Restarts." ICLR 2017.
- ↑ Sutskever, Ilya, et al. (2013). "On the importance of initialization and momentum in deep learning." ICML 2013.
- ↑ Duchi, John; Hazan, Elad; Singer, Yoram (2011). "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." JMLR 12: 2121–2159.
- ↑ Kingma, Diederik P.; Ba, Jimmy (2014). "Adam: A Method for Stochastic Optimization." ICLR 2015. arXiv:1412.6980.
- ↑ Loshchilov, Ilya; Hutter, Frank (2019). "Decoupled Weight Decay Regularization." ICLR 2019.
- ↑ Foret, Pierre, et al. (2021). "Sharpness-Aware Minimization for Efficiently Improving Generalization." ICLR 2021.
- ↑ You, Yang, et al. (2020). "Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes." ICLR 2020.
- ↑ Robbins, Herbert; Monro, Sutton (1951). "A Stochastic Approximation Method." Annals of Mathematical Statistics 22(3): 400–407.