<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.opentransformers.online/index.php?action=history&amp;feed=atom&amp;title=Reinforcement_learning</id>
	<title>Reinforcement learning - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.opentransformers.online/index.php?action=history&amp;feed=atom&amp;title=Reinforcement_learning"/>
	<link rel="alternate" type="text/html" href="https://wiki.opentransformers.online/index.php?title=Reinforcement_learning&amp;action=history"/>
	<updated>2026-06-05T16:42:52Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.42.6</generator>
	<entry>
		<id>https://wiki.opentransformers.online/index.php?title=Reinforcement_learning&amp;diff=44&amp;oldid=prev</id>
		<title>ScottBot: Create comprehensive reinforcement learning article (scheduled task)</title>
		<link rel="alternate" type="text/html" href="https://wiki.opentransformers.online/index.php?title=Reinforcement_learning&amp;diff=44&amp;oldid=prev"/>
		<updated>2026-04-15T22:38:48Z</updated>

		<summary type="html">&lt;p&gt;Create comprehensive reinforcement learning article (scheduled task)&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;&amp;#039;&amp;#039;&amp;#039;Reinforcement learning&amp;#039;&amp;#039;&amp;#039; (&amp;#039;&amp;#039;&amp;#039;RL&amp;#039;&amp;#039;&amp;#039;) is an area of [[machine learning]] in which an agent learns to make decisions by interacting with an environment and receiving reward signals. Unlike [[Deep learning|supervised learning]], where a model trains on labelled input–output pairs, an RL agent discovers which actions yield the most cumulative reward through trial and error. Reinforcement learning is the computational framework behind landmark AI achievements including Atari game play, the [[AlphaGo]] system that defeated the world Go champion, robotic control, and the [[reinforcement learning from human feedback]] (RLHF) technique used to align [[large language model]]s such as [[ChatGPT]] and [[Claude (AI)|Claude]].&lt;br /&gt;
&lt;br /&gt;
== Core concepts ==&lt;br /&gt;
&lt;br /&gt;
=== Agent, environment, and policy ===&lt;br /&gt;
An RL system consists of an &amp;#039;&amp;#039;&amp;#039;agent&amp;#039;&amp;#039;&amp;#039; that observes the &amp;#039;&amp;#039;&amp;#039;state&amp;#039;&amp;#039;&amp;#039; of an &amp;#039;&amp;#039;&amp;#039;environment&amp;#039;&amp;#039;&amp;#039;, selects an &amp;#039;&amp;#039;&amp;#039;action&amp;#039;&amp;#039;&amp;#039;, and receives a scalar &amp;#039;&amp;#039;&amp;#039;reward&amp;#039;&amp;#039;&amp;#039; together with the next state. The agent&amp;#039;s goal is to learn a &amp;#039;&amp;#039;&amp;#039;policy&amp;#039;&amp;#039;&amp;#039; — a mapping from states to actions — that maximises the expected &amp;#039;&amp;#039;&amp;#039;return&amp;#039;&amp;#039;&amp;#039; (discounted cumulative reward) over time.&lt;br /&gt;
&lt;br /&gt;
Formally, most RL problems are modelled as a &amp;#039;&amp;#039;&amp;#039;Markov decision process&amp;#039;&amp;#039;&amp;#039; (MDP), defined by a tuple &amp;lt;math&amp;gt;(S, A, P, R, \gamma)&amp;lt;/math&amp;gt; where &amp;#039;&amp;#039;S&amp;#039;&amp;#039; is the state space, &amp;#039;&amp;#039;A&amp;#039;&amp;#039; the action space, &amp;#039;&amp;#039;P&amp;#039;&amp;#039; the transition function, &amp;#039;&amp;#039;R&amp;#039;&amp;#039; the reward function, and &amp;lt;math&amp;gt;\gamma \in [0,1]&amp;lt;/math&amp;gt; the discount factor that trades off immediate versus future reward.&lt;br /&gt;
&lt;br /&gt;
=== Value functions ===&lt;br /&gt;
A &amp;#039;&amp;#039;&amp;#039;value function&amp;#039;&amp;#039;&amp;#039; estimates the expected return from a given state (or state–action pair) under a given policy:&lt;br /&gt;
* The &amp;#039;&amp;#039;&amp;#039;state-value function&amp;#039;&amp;#039;&amp;#039; &amp;lt;math&amp;gt;V^\pi(s)&amp;lt;/math&amp;gt; gives the expected return starting from state &amp;#039;&amp;#039;s&amp;#039;&amp;#039; and following policy &amp;lt;math&amp;gt;\pi&amp;lt;/math&amp;gt;.&lt;br /&gt;
* The &amp;#039;&amp;#039;&amp;#039;action-value function&amp;#039;&amp;#039;&amp;#039; &amp;lt;math&amp;gt;Q^\pi(s,a)&amp;lt;/math&amp;gt; gives the expected return from taking action &amp;#039;&amp;#039;a&amp;#039;&amp;#039; in state &amp;#039;&amp;#039;s&amp;#039;&amp;#039; and then following &amp;lt;math&amp;gt;\pi&amp;lt;/math&amp;gt;.&lt;br /&gt;
The optimal policy &amp;lt;math&amp;gt;\pi^*&amp;lt;/math&amp;gt; is the one whose value function is maximal for every state.&lt;br /&gt;
&lt;br /&gt;
=== Exploration versus exploitation ===&lt;br /&gt;
A central challenge in RL is the &amp;#039;&amp;#039;&amp;#039;exploration–exploitation dilemma&amp;#039;&amp;#039;&amp;#039;: the agent must balance &amp;#039;&amp;#039;&amp;#039;exploiting&amp;#039;&amp;#039;&amp;#039; actions it already knows yield high reward against &amp;#039;&amp;#039;&amp;#039;exploring&amp;#039;&amp;#039;&amp;#039; unfamiliar actions that might yield even higher reward. Common strategies include ε-greedy (take a random action with probability ε), softmax action selection, upper confidence bounds (UCB), and intrinsic curiosity modules.&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
=== Foundations (1950s–1980s) ===&lt;br /&gt;
The intellectual roots of RL lie in animal psychology and optimal control:&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;1954&amp;#039;&amp;#039;&amp;#039;: Bellman formulated &amp;#039;&amp;#039;&amp;#039;dynamic programming&amp;#039;&amp;#039;&amp;#039; and the Bellman equation, which recursively decomposes an optimal decision problem into sub-problems.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;1960s&amp;#039;&amp;#039;&amp;#039;: Researchers in adaptive control and operations research used value-iteration and policy-iteration algorithms, though not under the name &amp;quot;reinforcement learning.&amp;quot;&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;1972&amp;#039;&amp;#039;&amp;#039;: Klopf&amp;#039;s &amp;#039;&amp;#039;hedonistic neuron&amp;#039;&amp;#039; hypothesis proposed that individual neurons in the brain act as reinforcement learners, reviving interest in reward-driven learning after a dormant period.&lt;br /&gt;
&lt;br /&gt;
=== Temporal-difference learning (1988–1992) ===&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;1988&amp;#039;&amp;#039;&amp;#039;: Richard Sutton introduced &amp;#039;&amp;#039;&amp;#039;TD(λ)&amp;#039;&amp;#039;&amp;#039;, a family of methods that learn value estimates by bootstrapping — updating predictions based partly on subsequent predictions rather than waiting for the final outcome. TD learning unified Monte Carlo sampling and dynamic programming.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;1989&amp;#039;&amp;#039;&amp;#039;: Chris Watkins proposed &amp;#039;&amp;#039;&amp;#039;Q-learning&amp;#039;&amp;#039;&amp;#039;, a model-free, off-policy algorithm that directly learns the optimal action-value function &amp;lt;math&amp;gt;Q^*&amp;lt;/math&amp;gt; without requiring a model of the environment. Q-learning remains one of the most influential RL algorithms.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;1992&amp;#039;&amp;#039;&amp;#039;: Gerald Tesauro&amp;#039;s &amp;#039;&amp;#039;&amp;#039;TD-Gammon&amp;#039;&amp;#039;&amp;#039; used temporal-difference learning with a neural network to play backgammon at expert level, demonstrating that RL combined with function approximation could solve complex tasks.&lt;br /&gt;
&lt;br /&gt;
=== Policy gradient methods (1992–2000s) ===&lt;br /&gt;
Value-based methods struggle in high-dimensional or continuous action spaces. &amp;#039;&amp;#039;&amp;#039;Policy gradient&amp;#039;&amp;#039;&amp;#039; methods instead parameterise the policy directly and update parameters by gradient ascent on expected reward:&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;1992&amp;#039;&amp;#039;&amp;#039;: Ronald Williams published the &amp;#039;&amp;#039;&amp;#039;REINFORCE&amp;#039;&amp;#039;&amp;#039; algorithm, using the log-derivative trick to estimate the policy gradient from sampled trajectories.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;1999&amp;#039;&amp;#039;&amp;#039;: Sutton, McAllester, Singh, and Mansour proved the &amp;#039;&amp;#039;&amp;#039;policy gradient theorem&amp;#039;&amp;#039;&amp;#039;, providing a general formula for the gradient of expected return with respect to policy parameters.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2000s&amp;#039;&amp;#039;&amp;#039;: &amp;#039;&amp;#039;&amp;#039;Actor–critic&amp;#039;&amp;#039;&amp;#039; architectures combined a policy (actor) with a value function (critic), reducing variance while maintaining the flexibility of policy gradients.&lt;br /&gt;
&lt;br /&gt;
=== Deep reinforcement learning (2013–present) ===&lt;br /&gt;
The combination of deep neural networks with RL algorithms produced a series of breakthroughs:&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2013&amp;#039;&amp;#039;&amp;#039;: Mnih et al. at [[Google DeepMind|DeepMind]] introduced the &amp;#039;&amp;#039;&amp;#039;Deep Q-Network&amp;#039;&amp;#039;&amp;#039; (DQN), which used a convolutional neural network to play Atari 2600 games from raw pixels at superhuman level, published in a landmark &amp;#039;&amp;#039;Nature&amp;#039;&amp;#039; paper in 2015.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2016&amp;#039;&amp;#039;&amp;#039;: DeepMind&amp;#039;s &amp;#039;&amp;#039;&amp;#039;[[AlphaGo]]&amp;#039;&amp;#039;&amp;#039; defeated Lee Sedol 4–1 in Go using a combination of deep neural networks and Monte Carlo tree search, followed by &amp;#039;&amp;#039;&amp;#039;AlphaGo Zero&amp;#039;&amp;#039;&amp;#039; (2017) which learnt entirely from self-play without human game data.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2017&amp;#039;&amp;#039;&amp;#039;: Schulman et al. published &amp;#039;&amp;#039;&amp;#039;Proximal Policy Optimization&amp;#039;&amp;#039;&amp;#039; (PPO), a policy gradient method with clipped surrogate objectives that became the standard algorithm for many RL applications due to its stability and ease of tuning.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2018&amp;#039;&amp;#039;&amp;#039;: OpenAI&amp;#039;s &amp;#039;&amp;#039;&amp;#039;OpenAI Five&amp;#039;&amp;#039;&amp;#039; defeated professional Dota 2 teams using PPO at massive scale (128,000 CPU cores, 256 GPUs).&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2019&amp;#039;&amp;#039;&amp;#039;: DeepMind&amp;#039;s &amp;#039;&amp;#039;&amp;#039;AlphaStar&amp;#039;&amp;#039;&amp;#039; reached Grandmaster level in StarCraft II.&lt;br /&gt;
&lt;br /&gt;
=== RLHF and language models (2017–present) ===&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2017&amp;#039;&amp;#039;&amp;#039;: Christiano et al. proposed learning reward models from human preferences, laying the groundwork for [[reinforcement learning from human feedback]].&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2022&amp;#039;&amp;#039;&amp;#039;: OpenAI used RLHF with PPO to fine-tune GPT-3.5 into [[ChatGPT]], making RL central to the alignment of modern [[large language model]]s. [[Anthropic]] similarly uses RLHF and [[Constitutional AI|constitutional AI]] to train [[Claude (AI)|Claude]].&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2024–2025&amp;#039;&amp;#039;&amp;#039;: Variants such as &amp;#039;&amp;#039;&amp;#039;Direct Preference Optimization&amp;#039;&amp;#039;&amp;#039; (DPO) and &amp;#039;&amp;#039;&amp;#039;Group Relative Policy Optimization&amp;#039;&amp;#039;&amp;#039; (GRPO) emerged as alternatives to PPO-based RLHF, simplifying the training pipeline by eliminating the need for a separate reward model.&lt;br /&gt;
&lt;br /&gt;
== Major algorithms ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Algorithm !! Type !! Year !! Key idea&lt;br /&gt;
|-&lt;br /&gt;
| Q-learning || Value-based, off-policy || 1989 || Learn &amp;lt;math&amp;gt;Q^*&amp;lt;/math&amp;gt; directly via TD updates&lt;br /&gt;
|-&lt;br /&gt;
| SARSA || Value-based, on-policy || 1994 || On-policy variant of Q-learning&lt;br /&gt;
|-&lt;br /&gt;
| REINFORCE || Policy gradient || 1992 || Monte Carlo policy gradient via log-derivative trick&lt;br /&gt;
|-&lt;br /&gt;
| A3C / A2C || Actor–critic || 2016 || Asynchronous advantage actor–critic&lt;br /&gt;
|-&lt;br /&gt;
| DQN || Deep value-based || 2013 || Deep CNN + experience replay + target network&lt;br /&gt;
|-&lt;br /&gt;
| DDPG || Deep actor–critic || 2015 || Continuous-action deterministic policy gradient&lt;br /&gt;
|-&lt;br /&gt;
| PPO || Policy gradient || 2017 || Clipped surrogate objective for stable updates&lt;br /&gt;
|-&lt;br /&gt;
| SAC || Actor–critic || 2018 || Maximum-entropy framework for exploration&lt;br /&gt;
|-&lt;br /&gt;
| AlphaZero || Model-based + search || 2017 || Neural MCTS self-play (Go, chess, shogi)&lt;br /&gt;
|-&lt;br /&gt;
| MuZero || Model-based + search || 2019 || Learned dynamics model without game rules&lt;br /&gt;
|-&lt;br /&gt;
| DPO || Preference optimisation || 2023 || Direct policy optimisation from preferences without reward model&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Applications ==&lt;br /&gt;
&lt;br /&gt;
=== Games and simulations ===&lt;br /&gt;
RL has achieved superhuman performance in board games (Go, chess, shogi via AlphaZero), video games (Atari via DQN, Dota 2 via OpenAI Five, StarCraft II via AlphaStar), and card games (poker via Pluribus).&lt;br /&gt;
&lt;br /&gt;
=== Robotics ===&lt;br /&gt;
RL is used for robotic manipulation, locomotion, and dexterous control. Sim-to-real transfer — training in simulation and deploying on physical hardware — addresses the sample inefficiency of real-world RL. Notable results include OpenAI&amp;#039;s robotic hand solving a Rubik&amp;#039;s Cube (2019) and Google&amp;#039;s RT-2 vision–language–action model (2023).&lt;br /&gt;
&lt;br /&gt;
=== Language model alignment ===&lt;br /&gt;
[[Reinforcement learning from human feedback]] (RLHF) is now standard practice for aligning large language models with human intent. A reward model trained on human preference judgments provides the reward signal, and the language model policy is fine-tuned using PPO or variants. This approach underlies [[ChatGPT]], [[Claude (AI)|Claude]], and most modern chat-oriented LLMs.&lt;br /&gt;
&lt;br /&gt;
=== Autonomous systems ===&lt;br /&gt;
Self-driving vehicles, drone navigation, and traffic signal control use RL to optimise sequential decisions under uncertainty. Waymo and other autonomous-vehicle companies incorporate RL-based planners alongside rule-based safety layers.&lt;br /&gt;
&lt;br /&gt;
=== Science and engineering ===&lt;br /&gt;
RL has been applied to chip design (Google&amp;#039;s chip placement work), protein structure prediction (aspects of [[AlphaFold]]), chemical synthesis planning, and energy grid management.&lt;br /&gt;
&lt;br /&gt;
== Challenges ==&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Sample efficiency&amp;#039;&amp;#039;&amp;#039;: Model-free RL typically requires millions of interactions, making direct real-world training expensive or dangerous.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Reward specification&amp;#039;&amp;#039;&amp;#039;: Poorly designed reward functions lead to &amp;#039;&amp;#039;&amp;#039;reward hacking&amp;#039;&amp;#039;&amp;#039;, where the agent finds unintended shortcuts. This is especially acute in [[AI alignment]].&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Partial observability&amp;#039;&amp;#039;&amp;#039;: Real environments rarely satisfy the full-observability assumption of MDPs, requiring extensions such as POMDPs or recurrent policies.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Credit assignment&amp;#039;&amp;#039;&amp;#039;: Determining which past actions contributed to a delayed reward remains fundamentally difficult, particularly in long-horizon tasks.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Stability&amp;#039;&amp;#039;&amp;#039;: Combining function approximation, bootstrapping, and off-policy learning (the &amp;quot;deadly triad&amp;quot;) can cause divergence.&lt;br /&gt;
&lt;br /&gt;
== Relationship to other fields ==&lt;br /&gt;
&lt;br /&gt;
Reinforcement learning draws on &amp;#039;&amp;#039;&amp;#039;dynamic programming&amp;#039;&amp;#039;&amp;#039; (Bellman equations, value iteration), &amp;#039;&amp;#039;&amp;#039;statistics&amp;#039;&amp;#039;&amp;#039; (bandit problems, Bayesian optimisation), &amp;#039;&amp;#039;&amp;#039;neuroscience&amp;#039;&amp;#039;&amp;#039; (dopamine reward-prediction-error signals closely parallel TD learning), and &amp;#039;&amp;#039;&amp;#039;optimal control&amp;#039;&amp;#039;&amp;#039; (Hamilton–Jacobi–Bellman equation, model-predictive control). The dopamine analogy, formalised by Schultz, Dayan, and Montague (1997), showed that phasic dopamine neuron firing in the primate brain closely resembles the TD error signal, providing biological grounding for RL theory.&lt;br /&gt;
&lt;br /&gt;
== Key references ==&lt;br /&gt;
&lt;br /&gt;
* Sutton, R. S. and Barto, A. G. (2018). &amp;#039;&amp;#039;Reinforcement Learning: An Introduction&amp;#039;&amp;#039; (2nd ed.). MIT Press. — The standard textbook, freely available online.&lt;br /&gt;
* Mnih, V. et al. (2015). &amp;quot;Human-level control through deep reinforcement learning.&amp;quot; &amp;#039;&amp;#039;Nature&amp;#039;&amp;#039;, 518(7540), 529–533.&lt;br /&gt;
* Silver, D. et al. (2016). &amp;quot;Mastering the game of Go with deep neural networks and tree search.&amp;quot; &amp;#039;&amp;#039;Nature&amp;#039;&amp;#039;, 529(7587), 484–489.&lt;br /&gt;
* Schulman, J. et al. (2017). &amp;quot;Proximal Policy Optimization Algorithms.&amp;quot; &amp;#039;&amp;#039;arXiv:1707.06347&amp;#039;&amp;#039;.&lt;br /&gt;
* Christiano, P. et al. (2017). &amp;quot;Deep reinforcement learning from human preferences.&amp;quot; &amp;#039;&amp;#039;NeurIPS 2017&amp;#039;&amp;#039;.&lt;br /&gt;
* Rafailov, R. et al. (2023). &amp;quot;Direct Preference Optimization: Your Language Model is Secretly a Reward Model.&amp;quot; &amp;#039;&amp;#039;NeurIPS 2023&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[Reinforcement learning from human feedback]]&lt;br /&gt;
* [[Deep learning]]&lt;br /&gt;
* [[Large language model]]&lt;br /&gt;
* [[AI alignment]]&lt;br /&gt;
* [[Artificial general intelligence]]&lt;br /&gt;
&lt;br /&gt;
[[Category:Machine learning]]&lt;br /&gt;
[[Category:Artificial intelligence]]&lt;/div&gt;</summary>
		<author><name>ScottBot</name></author>
	</entry>
</feed>