Deep learning

Deep learning is a subset of machine learning that uses artificial neural networks with multiple layers (hence "deep") to learn hierarchical representations of data. It is the foundational technology behind modern large language models, computer vision systems, speech recognition, and many other AI applications.

Overview

Deep learning models consist of multiple processing layers that transform input data through successive stages of abstraction. Lower layers typically learn simple features (such as edges in images or phonemes in audio), while higher layers compose these into complex concepts (objects, sentences, melodies). This hierarchical feature learning distinguishes deep learning from earlier machine learning approaches that relied on hand-engineered features.

The term "deep" refers to the number of layers in the network. While a shallow neural network might have one or two hidden layers, deep networks commonly have tens, hundreds, or even thousands of layers.

History

Early foundations (1940s–1990s)

1943: Warren McCulloch and Walter Pitts proposed the first mathematical model of an artificial neuron.
1958: Frank Rosenblatt introduced the perceptron, a single-layer neural network capable of learning linear classifiers.
1986: David Rumelhart, Geoffrey Hinton, and Ronald Williams popularized backpropagation for training multi-layer networks, enabling gradient-based learning in deeper architectures.
1989: Yann LeCun applied convolutional neural networks (CNNs) to handwritten digit recognition, achieving practical success on the USPS zip code dataset.

AI winter and resurgence (1990s–2000s)

Despite early promise, neural networks fell out of favor in the 1990s as support vector machines and other kernel methods dominated. Deep networks were considered difficult to train due to the vanishing gradient problem.

2006: Geoffrey Hinton and colleagues demonstrated that deep belief networks could be effectively pre-trained layer-by-layer using unsupervised learning, reigniting interest.
2009: Researchers at IDSIA used GPUs to train deep neural networks orders of magnitude faster than on CPUs.

Modern era (2012–present)

2012: AlexNet, a deep CNN trained by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, won the ImageNet competition by a large margin, widely considered the catalyst for the deep learning revolution.
2014: Ian Goodfellow introduced Generative Adversarial Networks (GANs).
2015: ResNets (Residual Networks) by Kaiming He et al. enabled training of networks with over 100 layers using skip connections.
2017: The Transformer architecture was introduced in "Attention Is All You Need" by Vaswani et al. at Google, replacing recurrence with self-attention and becoming the basis for modern language models.
2020s: Scaling of Transformer-based models led to GPT-4, Claude, Gemini, and other frontier large language models.

Key architectures

Feedforward neural networks

The simplest deep learning architecture, consisting of an input layer, multiple hidden layers, and an output layer. Information flows in one direction. Used for tabular data classification and regression.

Convolutional neural networks (CNNs)

Designed for grid-structured data such as images. Convolutional layers apply learnable filters that detect local patterns, while pooling layers provide spatial invariance. Key models include LeNet, AlexNet, VGG, ResNet, and EfficientNet.

Recurrent neural networks (RNNs)

Designed for sequential data. RNNs maintain a hidden state that is updated at each time step, allowing them to process variable-length sequences. Variants include Long short-term memory (LSTM) and Gated Recurrent Units (GRUs). Largely superseded by Transformers for language tasks.

Transformers

Introduced in 2017, the Transformer uses self-attention mechanisms to process all positions in a sequence simultaneously, enabling massive parallelization and better modeling of long-range dependencies. It is the dominant architecture in natural language processing and increasingly in vision and audio.

Generative adversarial networks (GANs)

Consist of a generator network that produces synthetic data and a discriminator that distinguishes real from generated samples. The two networks are trained adversarially. Used for image generation, style transfer, and data augmentation.

Autoencoders and variational autoencoders

Unsupervised architectures that learn compressed representations by encoding inputs into a latent space and decoding them back. Variational autoencoders (VAEs) add probabilistic structure to enable generation of new samples.

Training

Deep learning models are trained using gradient-based optimization, typically with variants of stochastic gradient descent (SGD).

Backpropagation

The chain rule of calculus is used to compute gradients of the loss function with respect to each parameter in the network. These gradients indicate how to adjust weights to reduce the loss.

Optimization algorithms

Common optimizers include:

SGD with momentum: Adds a fraction of the previous update to the current one, smoothing the optimization trajectory.
Adam: Combines adaptive learning rates with momentum, widely used as a default optimizer.
AdamW: A variant of Adam with decoupled weight decay, standard for training Transformers.

Regularization

Techniques to prevent overfitting include dropout (randomly zeroing activations during training), weight decay, data augmentation, and early stopping.

Hardware

Deep learning's resurgence was enabled by Graphics Processing Units (GPUs), which can perform the matrix operations central to neural networks far faster than CPUs. NVIDIA GPUs dominate the training market. More recently, Google's Tensor Processing Units (TPUs) and custom accelerators from other companies have expanded the hardware landscape.

Modern frontier models require thousands of GPUs training in parallel for weeks or months, costing tens of millions of dollars.

Applications

Natural language processing: Machine translation, text generation, summarization, question answering (large language models)
Computer vision: Image classification, object detection, segmentation, generation
Speech: Speech recognition, text-to-speech synthesis
Science: Protein structure prediction (AlphaFold), drug discovery, weather forecasting
Autonomous systems: Self-driving vehicles, robotics
Game playing: AlphaGo, AlphaZero, OpenAI Five

Challenges

Computational cost: Training large models requires enormous compute resources and energy.
Data requirements: Deep learning models typically need large labeled datasets, though self-supervised methods have reduced this dependency.
Interpretability: Deep networks are often treated as "black boxes" — understanding why they make specific predictions remains an active research area (see Mechanistic interpretability).
Bias and fairness: Models can inherit and amplify biases present in training data.

Notable researchers

Geoffrey Hinton — Backpropagation, deep belief networks. 2024 Nobel Prize in Physics (shared).
Yann LeCun — Convolutional neural networks. Chief AI Scientist at Meta.
Yoshua Bengio — Attention mechanisms, neural machine translation foundations.
Ilya Sutskever — AlexNet co-creator, co-founder of OpenAI, later co-founded Safe Superintelligence Inc.

Hinton, LeCun, and Bengio are sometimes called the "godfathers of deep learning" and jointly received the 2018 Turing Award.

References

LeCun, Y., Bengio, Y., & Hinton, G. (2015). "Deep learning". Nature, 521(7553), 436–444.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Vaswani, A. et al. (2017). "Attention Is All You Need". NeurIPS 2017.
Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). "ImageNet Classification with Deep Convolutional Neural Networks". NeurIPS 2012.
He, K. et al. (2016). "Deep Residual Learning for Image Recognition". CVPR 2016.