Reinforcement learning

Reinforcement learning (RL) is an area of machine learning in which an agent learns to make decisions by interacting with an environment and receiving reward signals. Unlike supervised learning, where a model trains on labelled input–output pairs, an RL agent discovers which actions yield the most cumulative reward through trial and error. Reinforcement learning is the computational framework behind landmark AI achievements including Atari game play, the AlphaGo system that defeated the world Go champion, robotic control, and the reinforcement learning from human feedback (RLHF) technique used to align large language models such as ChatGPT and Claude.

Core concepts

Agent, environment, and policy

An RL system consists of an agent that observes the state of an environment, selects an action, and receives a scalar reward together with the next state. The agent's goal is to learn a policy — a mapping from states to actions — that maximises the expected return (discounted cumulative reward) over time.

Formally, most RL problems are modelled as a Markov decision process (MDP), defined by a tuple <math>(S, A, P, R, \gamma)</math> where S is the state space, A the action space, P the transition function, R the reward function, and <math>\gamma \in [0,1]</math> the discount factor that trades off immediate versus future reward.

Value functions

A value function estimates the expected return from a given state (or state–action pair) under a given policy:

The state-value function <math>V^\pi(s)</math> gives the expected return starting from state s and following policy <math>\pi</math>.
The action-value function <math>Q^\pi(s,a)</math> gives the expected return from taking action a in state s and then following <math>\pi</math>.

The optimal policy <math>\pi^*</math> is the one whose value function is maximal for every state.

Exploration versus exploitation

A central challenge in RL is the exploration–exploitation dilemma: the agent must balance exploiting actions it already knows yield high reward against exploring unfamiliar actions that might yield even higher reward. Common strategies include ε-greedy (take a random action with probability ε), softmax action selection, upper confidence bounds (UCB), and intrinsic curiosity modules.

History

Foundations (1950s–1980s)

The intellectual roots of RL lie in animal psychology and optimal control:

1954: Bellman formulated dynamic programming and the Bellman equation, which recursively decomposes an optimal decision problem into sub-problems.
1960s: Researchers in adaptive control and operations research used value-iteration and policy-iteration algorithms, though not under the name "reinforcement learning."
1972: Klopf's hedonistic neuron hypothesis proposed that individual neurons in the brain act as reinforcement learners, reviving interest in reward-driven learning after a dormant period.

Temporal-difference learning (1988–1992)

1988: Richard Sutton introduced TD(λ), a family of methods that learn value estimates by bootstrapping — updating predictions based partly on subsequent predictions rather than waiting for the final outcome. TD learning unified Monte Carlo sampling and dynamic programming.
1989: Chris Watkins proposed Q-learning, a model-free, off-policy algorithm that directly learns the optimal action-value function <math>Q^*</math> without requiring a model of the environment. Q-learning remains one of the most influential RL algorithms.
1992: Gerald Tesauro's TD-Gammon used temporal-difference learning with a neural network to play backgammon at expert level, demonstrating that RL combined with function approximation could solve complex tasks.

Policy gradient methods (1992–2000s)

Value-based methods struggle in high-dimensional or continuous action spaces. Policy gradient methods instead parameterise the policy directly and update parameters by gradient ascent on expected reward:

1992: Ronald Williams published the REINFORCE algorithm, using the log-derivative trick to estimate the policy gradient from sampled trajectories.
1999: Sutton, McAllester, Singh, and Mansour proved the policy gradient theorem, providing a general formula for the gradient of expected return with respect to policy parameters.
2000s: Actor–critic architectures combined a policy (actor) with a value function (critic), reducing variance while maintaining the flexibility of policy gradients.

Deep reinforcement learning (2013–present)

The combination of deep neural networks with RL algorithms produced a series of breakthroughs:

2013: Mnih et al. at DeepMind introduced the Deep Q-Network (DQN), which used a convolutional neural network to play Atari 2600 games from raw pixels at superhuman level, published in a landmark Nature paper in 2015.
2016: DeepMind's AlphaGo defeated Lee Sedol 4–1 in Go using a combination of deep neural networks and Monte Carlo tree search, followed by AlphaGo Zero (2017) which learnt entirely from self-play without human game data.
2017: Schulman et al. published Proximal Policy Optimization (PPO), a policy gradient method with clipped surrogate objectives that became the standard algorithm for many RL applications due to its stability and ease of tuning.
2018: OpenAI's OpenAI Five defeated professional Dota 2 teams using PPO at massive scale (128,000 CPU cores, 256 GPUs).
2019: DeepMind's AlphaStar reached Grandmaster level in StarCraft II.

RLHF and language models (2017–present)

2017: Christiano et al. proposed learning reward models from human preferences, laying the groundwork for reinforcement learning from human feedback.
2022: OpenAI used RLHF with PPO to fine-tune GPT-3.5 into ChatGPT, making RL central to the alignment of modern large language models. Anthropic similarly uses RLHF and constitutional AI to train Claude.
2024–2025: Variants such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) emerged as alternatives to PPO-based RLHF, simplifying the training pipeline by eliminating the need for a separate reward model.

Major algorithms

Algorithm	Type	Year	Key idea
Q-learning	Value-based, off-policy	1989	Learn <math>Q^*</math> directly via TD updates
SARSA	Value-based, on-policy	1994	On-policy variant of Q-learning
REINFORCE	Policy gradient	1992	Monte Carlo policy gradient via log-derivative trick
A3C / A2C	Actor–critic	2016	Asynchronous advantage actor–critic
DQN	Deep value-based	2013	Deep CNN + experience replay + target network
DDPG	Deep actor–critic	2015	Continuous-action deterministic policy gradient
PPO	Policy gradient	2017	Clipped surrogate objective for stable updates
SAC	Actor–critic	2018	Maximum-entropy framework for exploration
AlphaZero	Model-based + search	2017	Neural MCTS self-play (Go, chess, shogi)
MuZero	Model-based + search	2019	Learned dynamics model without game rules
DPO	Preference optimisation	2023	Direct policy optimisation from preferences without reward model

Applications

Games and simulations

RL has achieved superhuman performance in board games (Go, chess, shogi via AlphaZero), video games (Atari via DQN, Dota 2 via OpenAI Five, StarCraft II via AlphaStar), and card games (poker via Pluribus).

Robotics

RL is used for robotic manipulation, locomotion, and dexterous control. Sim-to-real transfer — training in simulation and deploying on physical hardware — addresses the sample inefficiency of real-world RL. Notable results include OpenAI's robotic hand solving a Rubik's Cube (2019) and Google's RT-2 vision–language–action model (2023).

Language model alignment

Reinforcement learning from human feedback (RLHF) is now standard practice for aligning large language models with human intent. A reward model trained on human preference judgments provides the reward signal, and the language model policy is fine-tuned using PPO or variants. This approach underlies ChatGPT, Claude, and most modern chat-oriented LLMs.

Autonomous systems

Self-driving vehicles, drone navigation, and traffic signal control use RL to optimise sequential decisions under uncertainty. Waymo and other autonomous-vehicle companies incorporate RL-based planners alongside rule-based safety layers.

Science and engineering

RL has been applied to chip design (Google's chip placement work), protein structure prediction (aspects of AlphaFold), chemical synthesis planning, and energy grid management.

Challenges

Sample efficiency: Model-free RL typically requires millions of interactions, making direct real-world training expensive or dangerous.
Reward specification: Poorly designed reward functions lead to reward hacking, where the agent finds unintended shortcuts. This is especially acute in AI alignment.
Partial observability: Real environments rarely satisfy the full-observability assumption of MDPs, requiring extensions such as POMDPs or recurrent policies.
Credit assignment: Determining which past actions contributed to a delayed reward remains fundamentally difficult, particularly in long-horizon tasks.
Stability: Combining function approximation, bootstrapping, and off-policy learning (the "deadly triad") can cause divergence.

Relationship to other fields

Reinforcement learning draws on dynamic programming (Bellman equations, value iteration), statistics (bandit problems, Bayesian optimisation), neuroscience (dopamine reward-prediction-error signals closely parallel TD learning), and optimal control (Hamilton–Jacobi–Bellman equation, model-predictive control). The dopamine analogy, formalised by Schultz, Dayan, and Montague (1997), showed that phasic dopamine neuron firing in the primate brain closely resembles the TD error signal, providing biological grounding for RL theory.

Key references

Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. — The standard textbook, freely available online.
Mnih, V. et al. (2015). "Human-level control through deep reinforcement learning." Nature, 518(7540), 529–533.
Silver, D. et al. (2016). "Mastering the game of Go with deep neural networks and tree search." Nature, 529(7587), 484–489.
Schulman, J. et al. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347.
Christiano, P. et al. (2017). "Deep reinforcement learning from human preferences." NeurIPS 2017.
Rafailov, R. et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS 2023.