Machine learning
Machine learning (ML) is a branch of artificial intelligence in which computer systems learn patterns from data and improve their performance on tasks without being explicitly programmed for each case. Instead of following hand-coded rules, a machine learning system builds a mathematical model from training examples and uses that model to make predictions or decisions on new, unseen data. Machine learning underpins most modern AI applications, including large language models, image recognition, recommendation systems, autonomous vehicles, medical diagnostics, and scientific discovery.
Overview
A machine learning system typically follows three steps:
- Training: The model is exposed to a dataset and adjusts its internal parameters to minimise a loss function — a measure of how far the model's predictions deviate from the correct answers.
- Validation: A held-out portion of data is used to tune hyperparameters and check for overfitting (memorising training data rather than learning general patterns).
- Inference: The trained model is deployed to make predictions on new inputs.
The field is conventionally divided into three major paradigms:
- Supervised learning: The training data consists of input–output pairs. The model learns a function that maps inputs to outputs. Examples include classification (spam detection, image labelling) and regression (price prediction, weather forecasting).
- Unsupervised learning: The training data has no labels. The model discovers structure in the data, such as clusters, latent factors, or density estimates. Examples include k-means clustering, principal component analysis (PCA), and generative models.
- Reinforcement learning: An agent interacts with an environment and learns from reward signals rather than labelled examples. The agent seeks to maximise cumulative reward over time.
Additional paradigms include semi-supervised learning (a mix of labelled and unlabelled data), self-supervised learning (the model generates its own labels from the data, as in masked language modelling), and transfer learning (reusing a model trained on one task for a related task).
History
Early work (1950s–1960s)
- 1950: Alan Turing proposed the idea that machines could learn from experience in his paper "Computing Machinery and Intelligence."
- 1957: Frank Rosenblatt introduced the perceptron, a single-layer neural network that could learn linear classifiers.
- 1959: Arthur Samuel coined the term "machine learning" while developing a checkers-playing program at IBM that improved through self-play.
- 1967: The nearest-neighbour algorithm was introduced, one of the simplest classification methods.
Statistical foundations (1970s–1990s)
- 1979: The backpropagation algorithm for training multi-layer neural networks was derived by multiple researchers, then popularised by Rumelhart, Hinton, and Williams in 1986.
- 1986: Decision trees (ID3, C4.5) became widely used for interpretable classification.
- 1988: Judea Pearl's Probabilistic Reasoning in Intelligent Systems formalised Bayesian networks for ML.
- 1995: Vladimir Vapnik and Corinna Cortes published the support vector machine (SVM), which dominated classification tasks through the 2000s.
- 1995: Random forests (Leo Breiman, 2001) and gradient boosting (Friedman, 1999) combined many weak learners into strong ensemble models.
- 1997: Hochreiter and Schmidhuber published LSTM, a recurrent neural network variant that could learn long-range dependencies.
The deep learning revolution (2006–present)
- 2006: Geoffrey Hinton demonstrated effective training of deep neural networks using unsupervised pre-training, reigniting interest in deep learning.
- 2012: AlexNet — a deep convolutional neural network — won the ImageNet competition by a dramatic margin, establishing deep learning as the dominant approach to computer vision.
- 2014: Ian Goodfellow introduced generative adversarial networks (GANs), enabling high-quality image generation.
- 2017: Vaswani et al. published "Attention Is All You Need," introducing the transformer architecture that replaced recurrence with self-attention. This architecture underpins all modern large language models.
- 2018–2020: Pre-trained language models — BERT (Google), GPT-2 and GPT-3 (OpenAI) — demonstrated that large transformer models trained on vast text corpora could be fine-tuned for virtually any NLP task.
- 2022–present: ChatGPT and Claude brought LLMs to mainstream use. Training now routinely involves reinforcement learning from human feedback (RLHF) and constitutional AI for alignment.
Key algorithms and methods
| Category | Algorithm | Key idea |
|---|---|---|
| Linear models | Linear/logistic regression | Weighted sum of features; fast, interpretable |
| Tree-based | Decision tree, random forest, gradient boosting (XGBoost, LightGBM) | Recursive splitting on features; ensemble averaging reduces variance |
| Kernel methods | Support vector machine | Map data to high-dimensional space via kernel trick; maximise margin |
| Instance-based | k-nearest neighbours | Classify by majority vote of nearest training examples |
| Neural networks | Multi-layer perceptron, CNN, RNN, LSTM, Transformer | Hierarchical feature learning via backpropagation |
| Probabilistic | Naïve Bayes, Gaussian processes, variational autoencoders | Explicit probabilistic modelling of uncertainty |
| Reinforcement | Q-learning, PPO, DQN, AlphaZero | Learn from reward signals via trial and error |
| Dimensionality reduction | PCA, t-SNE, UMAP | Project high-dimensional data to lower dimensions |
| Clustering | k-means, DBSCAN, hierarchical clustering | Group similar data points without labels |
Fundamental concepts
Bias–variance trade-off
A model with high bias makes strong assumptions and underfits the data (e.g., fitting a straight line to a curved relationship). A model with high variance is overly flexible and overfits (memorises noise). The goal is to find a model complexity that minimises total error — the sum of bias, variance, and irreducible noise.
Overfitting and regularisation
Overfitting occurs when a model performs well on training data but poorly on unseen data. Regularisation techniques penalise model complexity to prevent this:
- L1 regularisation (Lasso): drives some parameters to exactly zero, performing feature selection.
- L2 regularisation (Ridge): shrinks parameters toward zero.
- Dropout: randomly disables neurons during training (used in deep learning).
- Early stopping: halt training when validation performance stops improving.
Loss functions
The choice of loss function defines what the model optimises:
- Cross-entropy loss: standard for classification; measures the divergence between predicted and true probability distributions.
- Mean squared error: standard for regression.
- Contrastive/triplet loss: used in representation learning and embeddings.
Evaluation
Models are evaluated on held-out test data using metrics appropriate to the task:
- Classification: accuracy, precision, recall, F1 score, ROC-AUC.
- Regression: MSE, MAE, R².
- Ranking: NDCG, mean reciprocal rank.
Applications
- Natural language processing: Machine translation, text summarisation, question answering, language models.
- Computer vision: Image classification, object detection, medical imaging, autonomous driving.
- Speech and audio: Speech recognition (Whisper, Siri), speaker identification, music generation.
- Recommender systems: Netflix, Spotify, YouTube, Amazon product recommendations.
- Healthcare: Drug discovery, diagnostic imaging, clinical trial optimisation, genomics.
- Finance: Fraud detection, algorithmic trading, credit scoring, risk assessment.
- Science: Protein structure prediction (AlphaFold), weather forecasting, materials discovery, particle physics.
- Robotics: Motion planning, manipulation, sim-to-real transfer.
Challenges and limitations
- Data quality and quantity: ML models are only as good as their training data. Biased, incomplete, or noisy data produces biased or unreliable models.
- Interpretability: Deep learning models are often "black boxes." Techniques like SHAP, LIME, and mechanistic interpretability attempt to explain model decisions.
- Compute costs: Training large models requires substantial computational resources and energy. GPT-4's training reportedly cost over $100 million.
- Fairness and bias: Models can perpetuate or amplify societal biases present in training data.
- Safety and alignment: As models become more capable, ensuring they behave as intended becomes a core challenge — see AI alignment and AI safety.
Relationship to other fields
Machine learning draws on statistics (hypothesis testing, Bayesian inference), optimisation (gradient descent, convex optimisation), information theory (entropy, mutual information), neuroscience (biological inspiration for neural networks), and computer science (algorithms, computational complexity). It is closely related to data science (which applies ML to real-world data) and artificial intelligence (of which ML is the most successful modern sub-field).
Key references
- Mitchell, T. (1997). Machine Learning. McGraw-Hill. — Classic textbook defining the field.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
- Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.
- Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
- Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS 2017.