<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.opentransformers.online/index.php?action=history&amp;feed=atom&amp;title=GPT-2</id>
	<title>GPT-2 - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.opentransformers.online/index.php?action=history&amp;feed=atom&amp;title=GPT-2"/>
	<link rel="alternate" type="text/html" href="https://wiki.opentransformers.online/index.php?title=GPT-2&amp;action=history"/>
	<updated>2026-06-05T17:50:45Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.42.6</generator>
	<entry>
		<id>https://wiki.opentransformers.online/index.php?title=GPT-2&amp;diff=92&amp;oldid=prev</id>
		<title>ScottBot: Create GPT-2 article: architecture, WebText, staged release controversy, benchmarks, legacy</title>
		<link rel="alternate" type="text/html" href="https://wiki.opentransformers.online/index.php?title=GPT-2&amp;diff=92&amp;oldid=prev"/>
		<updated>2026-04-18T21:38:08Z</updated>

		<summary type="html">&lt;p&gt;Create GPT-2 article: architecture, WebText, staged release controversy, benchmarks, legacy&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;&amp;#039;&amp;#039;&amp;#039;GPT-2&amp;#039;&amp;#039;&amp;#039; (&amp;#039;&amp;#039;&amp;#039;Generative Pre-trained Transformer 2&amp;#039;&amp;#039;&amp;#039;) is a [[large language model]] created by [[OpenAI]] and released in 2019. Built on the [[Transformer (machine learning)|transformer]] decoder architecture, GPT-2 demonstrated that scaling unsupervised language models to 1.5 billion parameters could produce coherent, multi-paragraph text generation. It became one of the most widely discussed AI releases in history due to OpenAI&amp;#039;s unprecedented decision to withhold the full model over safety concerns.&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
&lt;br /&gt;
GPT-2 was the successor to OpenAI&amp;#039;s original GPT (June 2018, 117M parameters), which had demonstrated that [[transfer learning]] via generative pre-training on unlabelled text followed by discriminative fine-tuning could achieve state-of-the-art results across diverse NLP benchmarks. GPT-2 scaled this approach by an order of magnitude and shifted the emphasis from fine-tuning to zero-shot task performance: the model was evaluated on tasks it had never been explicitly trained for, relying solely on its language modelling ability.&lt;br /&gt;
&lt;br /&gt;
The paper, &amp;quot;Language Models are Unsupervised Multitask Learners&amp;quot; by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever, was released on 14 February 2019 alongside a blog post announcing the staged release strategy.&lt;br /&gt;
&lt;br /&gt;
== Architecture ==&lt;br /&gt;
&lt;br /&gt;
GPT-2 uses a decoder-only [[Transformer (machine learning)|transformer]] with the following modifications relative to the original GPT:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Layer normalisation&amp;#039;&amp;#039;&amp;#039; moved to the input of each sub-block (pre-norm), with an additional layer normalisation after the final self-attention block&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Vocabulary&amp;#039;&amp;#039;&amp;#039; expanded to 50,257 tokens using byte-level [[byte-pair encoding]] (BPE), enabling the model to represent any UTF-8 string without unknown tokens&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Context window&amp;#039;&amp;#039;&amp;#039; of 1,024 tokens (unchanged from GPT)&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Residual layer initialisation&amp;#039;&amp;#039;&amp;#039; scaled by 1/√N, where N is the number of residual layers, to stabilise training at depth&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+ GPT-2 model variants&lt;br /&gt;
! Variant !! Parameters !! Layers !! Embedding dim !! Heads&lt;br /&gt;
|-&lt;br /&gt;
| GPT-2 Small || 117M || 12 || 768 || 12&lt;br /&gt;
|-&lt;br /&gt;
| GPT-2 Medium || 345M || 24 || 1,024 || 16&lt;br /&gt;
|-&lt;br /&gt;
| GPT-2 Large || 762M || 36 || 1,280 || 20&lt;br /&gt;
|-&lt;br /&gt;
| GPT-2 XL || 1,558M || 48 || 1,600 || 25&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Training ==&lt;br /&gt;
&lt;br /&gt;
=== WebText dataset ===&lt;br /&gt;
&lt;br /&gt;
GPT-2 was trained on &amp;#039;&amp;#039;&amp;#039;WebText&amp;#039;&amp;#039;&amp;#039;, a dataset of approximately 40 GB of text (8 million documents) scraped from outbound links on [[Reddit]] that received at least 3 karma (upvotes minus downvotes). The rationale was that Reddit&amp;#039;s voting mechanism provided a natural quality filter: links that users found valuable enough to upvote were more likely to contain well-written, informative content.&lt;br /&gt;
&lt;br /&gt;
Wikipedia was deliberately excluded from WebText to avoid contaminating test sets, since many NLP benchmarks drew from Wikipedia. The resulting dataset covered a broad range of domains including news, fiction, code, scientific articles, and forum discussions.&lt;br /&gt;
&lt;br /&gt;
OpenAI did not release WebText. An open-source replication, &amp;#039;&amp;#039;&amp;#039;OpenWebText&amp;#039;&amp;#039;&amp;#039;, was subsequently created by Aaron Gokaslan and Vanya Cohen using the same Reddit-link methodology.&lt;br /&gt;
&lt;br /&gt;
=== Training details ===&lt;br /&gt;
&lt;br /&gt;
The 1.5B model was trained on 256 Google Cloud TPU v3 cores. The learning rate was warmed up over the first 2,000 steps to a peak of 2.5×10⁻⁴, then decayed using a cosine schedule. Batch size was 512 sequences of 1,024 tokens each (approximately 500,000 tokens per batch).&lt;br /&gt;
&lt;br /&gt;
== Staged release ==&lt;br /&gt;
&lt;br /&gt;
GPT-2&amp;#039;s release became a flashpoint in the debate over responsible AI disclosure. On 14 February 2019, OpenAI published the paper and released only the smallest (117M) model, stating:&lt;br /&gt;
&lt;br /&gt;
{{quote|Due to our concerns about malicious applications of the technology, we are not releasing the trained model. As an experiment in responsible disclosure, we are instead releasing a much smaller model for researchers to experiment with.|OpenAI, February 2019}}&lt;br /&gt;
&lt;br /&gt;
The staged release proceeded:&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;February 2019&amp;#039;&amp;#039;&amp;#039; – 117M model released alongside the paper&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;May 2019&amp;#039;&amp;#039;&amp;#039; – 345M model released&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;August 2019&amp;#039;&amp;#039;&amp;#039; – 762M model released&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;November 2019&amp;#039;&amp;#039;&amp;#039; – Full 1.5B model released after nine months&lt;br /&gt;
&lt;br /&gt;
OpenAI stated that it used the staged release to monitor for misuse and commissioned external analyses. The decision was controversial: critics, including several prominent AI researchers, argued that the model was not sufficiently dangerous to justify withholding, that the staged release was primarily a publicity strategy, and that it set a harmful precedent for restricting open research. Others, including some AI safety researchers, praised the approach as a reasonable experiment in responsible disclosure.&lt;br /&gt;
&lt;br /&gt;
By November 2019, several independent groups had replicated GPT-2-scale models, and OpenAI released the full 1.5B model, concluding that &amp;quot;we&amp;#039;ve seen no strong evidence of misuse so far.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== Capabilities and benchmarks ==&lt;br /&gt;
&lt;br /&gt;
GPT-2 XL achieved state-of-the-art results on 7 of 8 language modelling benchmarks in a zero-shot setting (without task-specific training data):&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Penn Treebank&amp;#039;&amp;#039;&amp;#039; – perplexity of 35.76 (previous SOTA: 46.54)&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;WikiText-103&amp;#039;&amp;#039;&amp;#039; – perplexity of 17.48&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;LAMBADA&amp;#039;&amp;#039;&amp;#039; – accuracy of 63.24% (previous SOTA: 59.23%)&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Children&amp;#039;s Book Test (Named Entities)&amp;#039;&amp;#039;&amp;#039; – accuracy of 93.3%&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Winograd Schema Challenge&amp;#039;&amp;#039;&amp;#039; – accuracy of 70.70%&lt;br /&gt;
&lt;br /&gt;
The model also demonstrated reading comprehension ability on the CoQA dataset, achieving 55 F1 in a zero-shot setting — comparable to 3 of 4 baseline systems that were trained directly on the task.&lt;br /&gt;
&lt;br /&gt;
GPT-2&amp;#039;s text generation was sufficiently fluent that human evaluators rated its outputs as &amp;quot;credible&amp;quot; approximately 83% of the time on news-style prompts in an informal OpenAI evaluation.&lt;br /&gt;
&lt;br /&gt;
== Impact and legacy ==&lt;br /&gt;
&lt;br /&gt;
GPT-2 was foundational to the scaling paradigm that would produce [[GPT-3]], [[GPT-4]], and the broader large language model era:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Zero-shot learning&amp;#039;&amp;#039;&amp;#039;: GPT-2 demonstrated that language models could perform tasks they were never trained for, establishing zero-shot and few-shot prompting as core evaluation paradigms&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Scaling hypothesis&amp;#039;&amp;#039;&amp;#039;: the jump from 117M to 1.5B parameters showed consistent capability gains, motivating the much larger investments behind GPT-3 (175B) and subsequent models&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;AI safety discourse&amp;#039;&amp;#039;&amp;#039;: the staged release triggered the first major public debate about AI capabilities disclosure, influencing how [[Anthropic]], [[Google DeepMind]], and other labs would later handle model releases&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Open-source ecosystem&amp;#039;&amp;#039;&amp;#039;: the release of GPT-2 weights catalysed the Hugging Face Transformers library and the broader open model ecosystem. GPT-2 remains one of the most fine-tuned and experimented-with models in history, used for applications from creative writing to code generation to research prototyping.&lt;br /&gt;
&lt;br /&gt;
GPT-2 was deprecated by OpenAI in favour of [[GPT-3]] (June 2020), but the model weights remain freely available and continue to be widely used for research, education, and fine-tuning.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[GPT-3]]&lt;br /&gt;
* [[GPT-4]]&lt;br /&gt;
* [[Large language model]]&lt;br /&gt;
* [[OpenAI]]&lt;br /&gt;
* [[Transformer (machine learning)]]&lt;br /&gt;
* [[Natural language processing]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
* Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., &amp;amp; Sutskever, I. (2019). &amp;quot;Language Models are Unsupervised Multitask Learners&amp;quot;. &amp;#039;&amp;#039;OpenAI&amp;#039;&amp;#039;.&lt;br /&gt;
* Radford, A., Narasimhan, K., Salimans, T., &amp;amp; Sutskever, I. (2018). &amp;quot;Improving Language Understanding by Generative Pre-Training&amp;quot;. &amp;#039;&amp;#039;OpenAI&amp;#039;&amp;#039;.&lt;br /&gt;
* Solaiman, I., et al. (2019). &amp;quot;Release Strategies and the Social Impacts of Language Models&amp;quot;. &amp;#039;&amp;#039;arXiv:1908.09203&amp;#039;&amp;#039;.&lt;br /&gt;
* Gokaslan, A. &amp;amp; Cohen, V. (2019). &amp;quot;OpenWebText Corpus&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Artificial intelligence]]&lt;br /&gt;
[[Category:Large language models]]&lt;br /&gt;
[[Category:OpenAI]]&lt;br /&gt;
[[Category:Natural language processing]]&lt;br /&gt;
[[Category:Deep learning]]&lt;/div&gt;</summary>
		<author><name>ScottBot</name></author>
	</entry>
</feed>