Artificial neural network

Artificial neural networks (ANNs, also called neural networks or NNs) are a class of machine learning models loosely inspired by the networks of biological neurons in animal brains. An ANN consists of interconnected computational units called artificial neurons, organised into layers, that transform input data into output predictions by passing it through a series of weighted sums and nonlinear functions. The weights of these connections are learned from data, typically using backpropagation and an optimisation algorithm such as stochastic gradient descent. Artificial neural networks are the foundation of modern deep learning and underlie virtually every state-of-the-art system in image recognition, speech recognition, natural language processing, protein structure prediction, and generative modelling.

Overview

The basic computational unit of an ANN is the artificial neuron (also called a node or unit). Each neuron receives one or more numerical inputs, multiplies them by learnable weights, sums the results, adds a learnable bias, and passes the total through a fixed nonlinear activation function to produce its output. Formally, a single neuron computes:

y = f (∑_i w_i x_i + b)

where x_i are the inputs, w_i are the weights, b is the bias, and f is the activation function.

Neurons are organised into layers. A typical feedforward network has an input layer that receives raw data, one or more hidden layers that progressively transform the representation, and an output layer that produces the final prediction. A network is called deep when it has more than one hidden layer; the field of deep learning is defined by this structural depth.

The power of neural networks comes from the universal approximation theorem: a feedforward network with a single hidden layer containing a finite number of neurons and a non-polynomial activation function can approximate any continuous function on a compact domain to arbitrary accuracy. In practice, deeper networks achieve this approximation more parameter-efficiently than wider shallow networks, which is the theoretical motivation for stacking many layers.

Architecture

Feedforward networks

The simplest and historically earliest ANN architecture is the feedforward neural network (also called a multilayer perceptron or MLP). Information flows strictly in one direction: from the input layer, through the hidden layers, to the output layer, with no cycles or feedback loops. Each neuron in layer l is connected to every neuron in layer l+1 (a fully connected or dense layer). Feedforward networks are universal function approximators and remain common as components inside larger architectures, but are rarely state-of-the-art on their own for structured inputs like images or sequences.

Specialised architectures

Different problem domains motivated architectural specialisations that impose useful inductive biases:

Convolutional neural networks (CNNs) replace fully connected layers with locally connected, weight-sharing convolutional filters. This makes them naturally suited to grid-structured data such as images, where translation invariance and local correlations matter. CNNs dominated computer vision from 2012 until the emergence of vision transformers.
Recurrent neural networks (RNNs) add recurrent connections so that the hidden state at time t depends on the hidden state at time t−1, allowing the network to process sequences of arbitrary length. Variants such as long short-term memory (LSTM) and the gated recurrent unit (GRU) introduce gating mechanisms to alleviate the vanishing-gradient problem.
Transformers dispense with recurrence entirely and instead rely on the self-attention mechanism to model dependencies between tokens in a sequence. Introduced in 2017, transformers now underlie all large language models and most state-of-the-art vision, speech, and biological sequence systems.
Graph neural networks generalise the convolution operation to arbitrary graph-structured data, propagating messages along edges.
Autoencoders and generative adversarial networks (GANs) are architectural patterns for unsupervised representation learning and generative modelling.

Neurons and activation functions

The activation function controls the nonlinearity of a neuron and critically affects how easily a network can be trained. Common choices include:

Sigmoid (σ(x) = 1/(1+e^−x)): maps inputs to (0, 1), historically popular but suffers from vanishing gradients and is now rarely used in hidden layers.
Hyperbolic tangent (tanh): maps inputs to (−1, 1), zero-centred version of the sigmoid.
Rectified linear unit (ReLU, f(x) = max(0, x)): introduced into deep networks in 2011 and now the default in most modern architectures. ReLU's simple gradient (0 or 1) sidesteps vanishing-gradient problems and is extremely cheap to compute.
Leaky ReLU, GELU, Swish and SiLU: smooth or leaky variants of ReLU used in modern transformer and vision architectures.
Softmax: used in output layers for multi-class classification to produce a probability distribution over classes.

A network with purely linear activations is mathematically equivalent to a single-layer linear model, so the nonlinearity is essential to the expressive power of deep networks.

Training

Training an ANN means finding the weights and biases that minimise a loss function measuring the discrepancy between the network's predictions and the desired outputs on a training dataset. The dominant training paradigm is:

Forward pass: run a batch of training examples through the network to compute predictions and the loss.
Backward pass: use backpropagation to compute the gradient of the loss with respect to every weight and bias by applying the chain rule layer by layer.
Parameter update: adjust every weight in the direction that reduces the loss, using an optimiser such as stochastic gradient descent (SGD), SGD with momentum, RMSProp, or Adam.

Common loss functions include mean squared error for regression, cross-entropy for classification, and a variety of task-specific losses for structured prediction and generative modelling.

Training a large neural network reliably requires a number of additional techniques developed over several decades:

Weight initialisation schemes such as Xavier (Glorot) and Kaiming (He) initialisation, which scale initial weights so activations and gradients remain at a usable magnitude through deep networks.
Regularisation methods such as L2 weight decay, dropout, data augmentation, and early stopping, which reduce overfitting to the training set.
Normalisation layers such as batch normalisation, layer normalisation, and RMSNorm, which stabilise the distribution of activations and accelerate training.
Learning-rate schedules such as cosine decay, warmup, and cyclical schedules, which often have a larger effect on final performance than the optimiser itself.
Gradient clipping to prevent exploding gradients in recurrent and very deep networks.

The training data is typically divided into a training set, a validation set used to tune hyperparameters and monitor overfitting, and a held-out test set used only for final evaluation.

History

Early foundations (1943–1969)

The first mathematical model of a neuron was proposed in 1943 by neurophysiologist Warren McCulloch and logician Walter Pitts, who showed that simple threshold units connected in networks could compute any logical function. In 1949 psychologist Donald Hebb formulated the learning rule that now bears his name — neurons that fire together strengthen their connection — which inspired many subsequent learning algorithms.

In 1958 Frank Rosenblatt introduced the perceptron, a single-layer trainable linear classifier, at the Cornell Aeronautical Laboratory. The perceptron generated widespread excitement about machine intelligence but had severe limitations: in 1969, Marvin Minsky and Seymour Papert published Perceptrons, which proved that single-layer perceptrons could not represent simple functions such as XOR. Their analysis is widely credited with contributing to the first AI winter, a period of reduced funding and interest in neural network research that lasted roughly into the mid-1980s.

Backpropagation and the connectionist revival (1970s–1990s)

The backpropagation algorithm, which enables training of multi-layer networks, was developed in various forms by several researchers independently, including Seppo Linnainmaa (1970), Paul Werbos (1974), and David Parker (1985). It entered mainstream machine learning after the 1986 Nature paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams, which demonstrated that backpropagation could train networks to learn useful internal representations.

The late 1980s and 1990s saw the introduction of several architectures that remain in use today. Yann LeCun and collaborators developed LeNet, an early convolutional network for handwritten digit recognition (1989, refined as LeNet-5 in 1998), which was deployed by banks to read cheques. Sepp Hochreiter and Jürgen Schmidhuber introduced long short-term memory in 1997 to solve the vanishing-gradient problem in recurrent networks.

Despite these advances, neural networks remained uncompetitive with support vector machines and other methods on most benchmarks throughout the 1990s and early 2000s, partly because of limited computing power and small datasets.

Deep learning era (2006–present)

A conventional marker for the start of the modern deep learning era is a 2006 series of papers by Geoffrey Hinton and collaborators on unsupervised pre-training of deep belief networks. The field's decisive breakthrough came in 2012, when Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton's AlexNet — an 8-layer convolutional network trained on two consumer GPUs — won the ImageNet Large Scale Visual Recognition Challenge by a wide margin, cutting the top-5 error from 25.8% to 16.4%. This result triggered near-universal adoption of deep learning in computer vision.

Subsequent milestones reshaped the field rapidly:

2014–2015: Sequence-to-sequence models, the attention mechanism (Bahdanau et al., 2014), and residual networks (ResNet, He et al., 2015) enabled training of much deeper networks.
2017: Vaswani et al. introduced the Transformer architecture, which became the dominant model family across nearly every domain.
2018–2020: Large language models — BERT (2018), GPT-2 (2019), and GPT-3 (2020) — demonstrated that scaling transformers with self-supervised learning on large text corpora produced capable general-purpose systems.
2020–present: Neural networks matched or exceeded human performance on protein structure prediction (AlphaFold), competitive programming, and many knowledge-work tasks; generative models produced near-photorealistic images, audio, and video.

Geoffrey Hinton, Yann LeCun and Yoshua Bengio received the 2018 A. M. Turing Award "for conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing."

Applications

Artificial neural networks are now the dominant technique across a wide range of fields:

Computer vision: image classification, object detection, semantic segmentation, face recognition, medical imaging diagnosis.
Natural language processing: machine translation, question answering, summarisation, code generation, conversational assistants.
Speech: speech recognition, speaker identification, text-to-speech synthesis.
Scientific discovery: protein structure prediction (AlphaFold), drug discovery, weather forecasting (GraphCast), materials science.
Reinforcement learning: game playing (AlphaGo, AlphaZero, MuZero), robotics control, recommender systems.
Generative modelling: image synthesis (diffusion models), music and video generation, drug and protein design.
Finance and industry: fraud detection, time series forecasting, industrial anomaly detection, predictive maintenance.

Limitations and criticisms

Despite their empirical success, artificial neural networks have well-documented limitations.

Data and compute requirements: modern networks are often trained on millions to trillions of labelled or raw tokens, with training runs consuming many thousands of GPU-hours. This creates barriers to entry and concentrates capability in well-funded laboratories.
Opacity: a trained network's weights are typically difficult to interpret. The field of mechanistic interpretability attempts to reverse-engineer what computations networks actually perform, but remains an active research area rather than a solved problem.
Robustness: neural networks are vulnerable to adversarial examples — small, carefully crafted perturbations to inputs that cause confident misclassification — and often fail to generalise to inputs drawn from distributions different from their training data.
Hallucination and factuality: large generative neural networks, particularly language models, can produce confident but factually incorrect outputs.
Sample inefficiency: current neural networks typically require orders of magnitude more data than a human learner to reach comparable performance on many tasks.
Bias and fairness: networks trained on real-world data can reproduce and amplify biases present in that data, with documented effects in hiring, credit scoring, and criminal justice applications.

Research programmes in AI alignment, mechanistic interpretability, and constitutional AI aim to address some of these limitations, particularly as neural networks are deployed in increasingly high-stakes settings.

Relationship to biological neural networks

The inspiration from biology is largely metaphorical. Real biological neurons operate asynchronously, communicate via spike trains rather than continuous-valued outputs, use neuromodulators and complex dendritic computation, and do not perform anything closely resembling backpropagation. Modern artificial neural networks are best understood as powerful parametric function approximators whose computational structure was inspired by neuroscience but whose design is now driven primarily by engineering considerations and empirical performance. Spiking neural networks and neuromorphic computing pursue more biologically plausible models, but at present are not competitive with standard deep networks on most benchmarks.

References

McCulloch, W. S., and Pitts, W. (1943). "A logical calculus of the ideas immanent in nervous activity." Bulletin of Mathematical Biophysics 5(4): 115–133.
Rosenblatt, F. (1958). "The perceptron: A probabilistic model for information storage and organization in the brain." Psychological Review 65(6): 386–408.
Minsky, M., and Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). "Learning representations by back-propagating errors." Nature 323(6088): 533–536.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86(11): 2278–2324.
Hochreiter, S., and Schmidhuber, J. (1997). "Long short-term memory." Neural Computation 9(8): 1735–1780.
Hinton, G. E., Osindero, S., and Teh, Y. W. (2006). "A fast learning algorithm for deep belief nets." Neural Computation 18(7): 1527–1554.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). "ImageNet classification with deep convolutional neural networks." Advances in Neural Information Processing Systems 25.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). "Deep residual learning for image recognition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition: 770–778.
Vaswani, A., et al. (2017). "Attention is all you need." Advances in Neural Information Processing Systems 30.
Cybenko, G. (1989). "Approximation by superpositions of a sigmoidal function." Mathematics of Control, Signals and Systems 2(4): 303–314.
Hornik, K. (1991). "Approximation capabilities of multilayer feedforward networks." Neural Networks 4(2): 251–257.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.