Reinforcement learning from human feedback

Reinforcement learning from human feedback (RLHF) is a machine learning technique that trains artificial intelligence models to align their outputs with human preferences. It combines reinforcement learning with human evaluative judgments to fine-tune large language models and other AI systems beyond what supervised learning alone can achieve.

Overview

RLHF addresses a fundamental challenge in AI development: specifying what "good" output looks like is often easier for humans to judge than to formally define. Rather than relying solely on fixed loss functions, RLHF uses human feedback as a training signal, enabling models to learn nuanced preferences about helpfulness, accuracy, and safety.

The technique rose to prominence following the release of ChatGPT in November 2022, which used RLHF to produce more conversational and helpful responses compared to its base model.

Process

RLHF typically involves three stages:

1. Supervised fine-tuning

A pre-trained language model is first fine-tuned on a dataset of demonstrations — high-quality examples of desired behavior written by human annotators. This produces an initial policy that can follow instructions.

2. Reward model training

Human evaluators compare pairs of model outputs and indicate which response is better. These preference comparisons are used to train a separate reward model that predicts a scalar score representing how well a given output matches human preferences. The reward model learns to generalize from the comparison data to assign scores to novel outputs.

3. Policy optimization

The language model is then optimized using reinforcement learning (typically Proximal Policy Optimization, or PPO) to maximize the reward model's predicted score. A KL-divergence penalty against the supervised fine-tuned model prevents the policy from diverging too far from natural language and exploiting the reward model.

History

The conceptual foundations of learning from human feedback date to work by Andrew Ng and Stuart Russell in 2000 on reward shaping from human demonstrations. Key milestones include:

2017: Christiano et al. at OpenAI published "Deep Reinforcement Learning from Human Preferences", demonstrating the approach on Atari games and simulated robotics tasks.
2020: Stiennon et al. applied RLHF to text summarization, showing significant improvements over supervised-only baselines.
2022: OpenAI released InstructGPT, a GPT-3 model fine-tuned with RLHF that was preferred by labelers over the much larger base GPT-3.
2022: ChatGPT brought RLHF to mainstream attention.
2023: Anthropic published research on RLHF combined with Constitutional AI, using AI-generated feedback in addition to human preferences.

Variants and extensions

Constitutional AI (CAI)

Anthropic's Constitutional AI replaces some human feedback with AI self-critique guided by a set of principles (a "constitution"). The model generates responses, critiques them against the principles, and revises them, reducing reliance on human annotators while maintaining alignment.

Direct Preference Optimization (DPO)

Proposed by Rafailov et al. in 2023, DPO reformulates the RLHF objective to eliminate the need for an explicit reward model and RL optimization loop. Instead, it directly optimizes the policy using preference pairs through a classification-style loss, simplifying the training pipeline considerably.

RLAIF

Reinforcement Learning from AI Feedback uses a capable AI model to provide preference judgments instead of human annotators, enabling larger-scale preference data collection.

Challenges

Reward hacking: The policy may learn to exploit weaknesses in the reward model, producing outputs that score highly but are not genuinely preferred by humans.
Scalability: Collecting high-quality human preference data is expensive and slow.
Consistency: Different human annotators often disagree, introducing noise into the training signal.
Specification gaming: Models may learn to be persuasive rather than truthful if evaluators cannot reliably distinguish the two.

Applications

RLHF is used in most commercially deployed large language models, including:

ChatGPT and GPT-4 (OpenAI)
Claude (Anthropic)
Gemini (Google DeepMind)
LLaMA-based chat models (Meta)

References

Christiano, P. et al. (2017). "Deep Reinforcement Learning from Human Preferences". NeurIPS 2017.
Stiennon, N. et al. (2020). "Learning to summarize from human feedback". NeurIPS 2020.
Ouyang, L. et al. (2022). "Training language models to follow instructions with human feedback". NeurIPS 2022.
Rafailov, R. et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". NeurIPS 2023.
Bai, Y. et al. (2022). "Constitutional AI: Harmlessness from AI Feedback". Anthropic.