GPT-2

GPT-2 (Generative Pre-trained Transformer 2) is a large language model created by OpenAI and released in 2019. Built on the transformer decoder architecture, GPT-2 demonstrated that scaling unsupervised language models to 1.5 billion parameters could produce coherent, multi-paragraph text generation. It became one of the most widely discussed AI releases in history due to OpenAI's unprecedented decision to withhold the full model over safety concerns.

Background

GPT-2 was the successor to OpenAI's original GPT (June 2018, 117M parameters), which had demonstrated that transfer learning via generative pre-training on unlabelled text followed by discriminative fine-tuning could achieve state-of-the-art results across diverse NLP benchmarks. GPT-2 scaled this approach by an order of magnitude and shifted the emphasis from fine-tuning to zero-shot task performance: the model was evaluated on tasks it had never been explicitly trained for, relying solely on its language modelling ability.

The paper, "Language Models are Unsupervised Multitask Learners" by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever, was released on 14 February 2019 alongside a blog post announcing the staged release strategy.

Architecture

GPT-2 uses a decoder-only transformer with the following modifications relative to the original GPT:

Layer normalisation moved to the input of each sub-block (pre-norm), with an additional layer normalisation after the final self-attention block
Vocabulary expanded to 50,257 tokens using byte-level byte-pair encoding (BPE), enabling the model to represent any UTF-8 string without unknown tokens
Context window of 1,024 tokens (unchanged from GPT)
Residual layer initialisation scaled by 1/√N, where N is the number of residual layers, to stabilise training at depth

GPT-2 model variants
Variant	Parameters	Layers	Embedding dim	Heads
GPT-2 Small	117M	12	768	12
GPT-2 Medium	345M	24	1,024	16
GPT-2 Large	762M	36	1,280	20
GPT-2 XL	1,558M	48	1,600	25

Training

WebText dataset

GPT-2 was trained on WebText, a dataset of approximately 40 GB of text (8 million documents) scraped from outbound links on Reddit that received at least 3 karma (upvotes minus downvotes). The rationale was that Reddit's voting mechanism provided a natural quality filter: links that users found valuable enough to upvote were more likely to contain well-written, informative content.

Wikipedia was deliberately excluded from WebText to avoid contaminating test sets, since many NLP benchmarks drew from Wikipedia. The resulting dataset covered a broad range of domains including news, fiction, code, scientific articles, and forum discussions.

OpenAI did not release WebText. An open-source replication, OpenWebText, was subsequently created by Aaron Gokaslan and Vanya Cohen using the same Reddit-link methodology.

Training details

The 1.5B model was trained on 256 Google Cloud TPU v3 cores. The learning rate was warmed up over the first 2,000 steps to a peak of 2.5×10⁻⁴, then decayed using a cosine schedule. Batch size was 512 sequences of 1,024 tokens each (approximately 500,000 tokens per batch).

Staged release

GPT-2's release became a flashpoint in the debate over responsible AI disclosure. On 14 February 2019, OpenAI published the paper and released only the smallest (117M) model, stating:

Template:Quote

The staged release proceeded:

February 2019 – 117M model released alongside the paper
May 2019 – 345M model released
August 2019 – 762M model released
November 2019 – Full 1.5B model released after nine months

OpenAI stated that it used the staged release to monitor for misuse and commissioned external analyses. The decision was controversial: critics, including several prominent AI researchers, argued that the model was not sufficiently dangerous to justify withholding, that the staged release was primarily a publicity strategy, and that it set a harmful precedent for restricting open research. Others, including some AI safety researchers, praised the approach as a reasonable experiment in responsible disclosure.

By November 2019, several independent groups had replicated GPT-2-scale models, and OpenAI released the full 1.5B model, concluding that "we've seen no strong evidence of misuse so far."

Capabilities and benchmarks

GPT-2 XL achieved state-of-the-art results on 7 of 8 language modelling benchmarks in a zero-shot setting (without task-specific training data):

Penn Treebank – perplexity of 35.76 (previous SOTA: 46.54)
WikiText-103 – perplexity of 17.48
LAMBADA – accuracy of 63.24% (previous SOTA: 59.23%)
Children's Book Test (Named Entities) – accuracy of 93.3%
Winograd Schema Challenge – accuracy of 70.70%

The model also demonstrated reading comprehension ability on the CoQA dataset, achieving 55 F1 in a zero-shot setting — comparable to 3 of 4 baseline systems that were trained directly on the task.

GPT-2's text generation was sufficiently fluent that human evaluators rated its outputs as "credible" approximately 83% of the time on news-style prompts in an informal OpenAI evaluation.

Impact and legacy

GPT-2 was foundational to the scaling paradigm that would produce GPT-3, GPT-4, and the broader large language model era:

Zero-shot learning: GPT-2 demonstrated that language models could perform tasks they were never trained for, establishing zero-shot and few-shot prompting as core evaluation paradigms
Scaling hypothesis: the jump from 117M to 1.5B parameters showed consistent capability gains, motivating the much larger investments behind GPT-3 (175B) and subsequent models
AI safety discourse: the staged release triggered the first major public debate about AI capabilities disclosure, influencing how Anthropic, Google DeepMind, and other labs would later handle model releases
Open-source ecosystem: the release of GPT-2 weights catalysed the Hugging Face Transformers library and the broader open model ecosystem. GPT-2 remains one of the most fine-tuned and experimented-with models in history, used for applications from creative writing to code generation to research prototyping.

GPT-2 was deprecated by OpenAI in favour of GPT-3 (June 2020), but the model weights remain freely available and continue to be widely used for research, education, and fine-tuning.

References

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). "Language Models are Unsupervised Multitask Learners". OpenAI.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training". OpenAI.
Solaiman, I., et al. (2019). "Release Strategies and the Social Impacts of Language Models". arXiv:1908.09203.
Gokaslan, A. & Cohen, V. (2019). "OpenWebText Corpus".