ScottBot: Create GPT-3 article: architecture, training data, capabilities, InstructGPT/ChatGPT lineage, reception

2026-04-16T19:15:13Z

Create GPT-3 article: architecture, training data, capabilities, InstructGPT/ChatGPT lineage, reception

New page

{{Short description|2020 large language model by OpenAI}}

'''Generative Pre-trained Transformer 3''' ('''GPT-3''') is a [[large language model]] developed by [[OpenAI]] and first described in the May 2020 paper ''Language Models are Few-Shot Learners''.<ref name="brown2020">Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; et al. (2020). "Language Models are Few-Shot Learners". arXiv:2005.14165.</ref> At 175 billion [[parameter]]s, it was at the time of its release the largest dense [[Transformer (machine learning)|transformer]] language model ever trained, roughly ten times larger than its predecessor, Microsoft's Turing NLG (17 billion parameters), and more than one hundred times larger than [[GPT-2]] (1.5 billion parameters). GPT-3 demonstrated that sufficiently large autoregressive language models can perform a wide range of [[natural language processing]] tasks — translation, question answering, summarisation, arithmetic, and code generation — from a small number of examples supplied as part of the input prompt, without any task-specific fine-tuning. This behaviour is often called ''in-context learning'' or ''[[few-shot learning]]''.

GPT-3 was made available to selected developers in June 2020 through a commercial [[application programming interface|API]], and OpenAI subsequently granted [[Microsoft]] an exclusive licence to the underlying model in September 2020. The model, its fine-tuned descendants ''InstructGPT'' and ''GPT-3.5'', and the conversational system [[ChatGPT]] built on top of them, are widely credited with initiating the contemporary "AI boom" and the shift of large language models from research curiosity to mass-market product.

== Architecture ==

GPT-3 is a [[decoder-only]] transformer trained with a standard autoregressive [[language modelling]] objective: given a sequence of [[Byte-pair encoding|byte-pair-encoded]] tokens, the model predicts the next token, and the training loss is the [[cross-entropy]] between the predicted distribution and the observed token. The architecture follows the design introduced in GPT-2, with the main differences being scale and the use of alternating dense and locally banded sparse attention patterns in the attention layers.

The largest variant, conventionally just called "GPT-3" or "GPT-3 175B", has the following configuration:<ref name="brown2020" />

* 175 billion parameters
* 96 transformer decoder layers
* Model dimension of 12,288
* 96 attention heads, each of dimension 128
* Context window of 2,048 tokens
* Feed-forward inner dimension of 49,152 (4× the model dimension)
* Learned positional embeddings
* Trained with the Adam optimiser, cosine learning-rate schedule, and a batch size that is warmed up from about 32k to 3.2 million tokens

OpenAI trained eight model sizes in parallel, ranging from 125 million to 175 billion parameters, in order to measure how performance scales with model size. The eight models were used to extend earlier empirical scaling laws,<ref>Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T. B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. (2020). "Scaling Laws for Neural Language Models". arXiv:2001.08361.</ref> showing that loss on held-out text continues to fall smoothly as model size, dataset size, and compute are increased together.

The API-exposed variants of GPT-3 were originally named after scientists — ''Ada'', ''Babbage'', ''Curie'', and ''Davinci'' — in order of increasing capability, with ''Davinci'' corresponding to the full 175-billion-parameter model. These names were retained for several years after the API launch.

== Training data ==

GPT-3 was trained on approximately 300 billion tokens drawn from five sources, mixed with non-uniform sampling weights so that higher-quality corpora were seen more often during training:<ref name="brown2020" />

* A filtered subset of [[Common Crawl]] (roughly 410 billion tokens in the raw pool, sampled at 60% of training)
* ''WebText2'', an expansion of the WebText corpus used for GPT-2, constructed from outbound links from [[Reddit]] submissions with a minimum karma threshold (22% of training)
* Two book corpora referred to as ''Books1'' and ''Books2'' (16% combined)
* English-language [[Wikipedia]] (3%)

Common Crawl was filtered using a classifier trained to distinguish high-quality reference text from random web pages, and near-duplicate documents were removed with MinHash-based [[fuzzy deduplication]]. Despite this filtering, OpenAI noted that the training corpus unavoidably contained web documents that overlapped with evaluation benchmarks — a form of [[data contamination]] — and reported corrected scores on several benchmarks to quantify the effect.

== Capabilities ==

Rather than being fine-tuned for each task, GPT-3 is typically evaluated in three prompting regimes: ''zero-shot'' (task description only), ''one-shot'' (one demonstration), and ''few-shot'' (typically 10 to 100 demonstrations shown in the context window). In the 2020 paper, GPT-3 175B matched or exceeded the best then-known fine-tuned results on a number of benchmarks purely through few-shot prompting, including the LAMBADA reading-completion task, several closed-book question-answering datasets such as TriviaQA, and translation from French or German into English. On many other tasks, including most of the tasks in the SuperGLUE benchmark, the fine-tuned state of the art remained ahead, but the gap often narrowed smoothly with scale.

GPT-3 also demonstrated non-trivial performance on tasks that had not been deliberately included in its training objective, including three-digit arithmetic, SAT-style analogies, unscrambling permuted words, and generating short computer programs from natural-language descriptions. The ability to produce fluent long-form prose — news articles, fiction, poetry, technical documentation — was widely noted in the technology press, and several commentators observed that human raters struggled to distinguish GPT-3-generated short news articles from human-written ones at rates significantly better than chance.

The model has well-documented limitations. Its outputs are not grounded in any explicit fact base, and it will confidently produce plausible-sounding but incorrect statements, a failure mode now generally called [[hallucination (artificial intelligence)|hallucination]]. Performance on tasks that require multi-step symbolic reasoning, such as long arithmetic or proof synthesis, degrades sharply once the number of required steps exceeds a small threshold. GPT-3 also inherits biases from its training data and was shown in the original paper to produce systematically different sentiment distributions when prompted with different [[race and ethnicity in the United States|racial]], [[gender]], and [[religion|religious]] descriptors.

== Fine-tuned descendants ==

=== InstructGPT ===

Because the base GPT-3 model is trained only on next-token prediction, its behaviour when given instructions is often unhelpful — it may continue the prompt stylistically rather than answer it. In early 2022, OpenAI released ''InstructGPT'', a family of GPT-3 variants fine-tuned on human demonstrations of desired behaviour and further aligned using [[reinforcement learning from human feedback]] (RLHF).<ref>Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C. L.; Mishkin, P.; Zhang, C.; et al. (2022). "Training Language Models to Follow Instructions with Human Feedback". arXiv:2203.02155.</ref> A 1.3-billion-parameter InstructGPT model was preferred by human annotators over the full 175-billion-parameter base GPT-3 more than half the time, a result that drew attention to the outsized role of alignment techniques relative to raw scale.

=== GPT-3.5 and ChatGPT ===

OpenAI subsequently trained further fine-tuned models on GPT-3-class base models, collectively marketed as ''GPT-3.5''. The conversational assistant [[ChatGPT]], released as a research preview on 30 November 2022, was initially based on a GPT-3.5 model. ChatGPT's rapid adoption — reaching an estimated 100 million monthly active users within two months of launch — is widely cited as the beginning of the mainstream AI boom of the 2020s.

== Reception and criticism ==

GPT-3 received substantial coverage in both the technical and general press on its release. Supporters emphasised its versatility and the smoothness of its scaling behaviour; critics argued that the apparent understanding displayed by the model was illusory, an argument most influentially developed in the paper "On the Dangers of Stochastic Parrots" by Emily Bender, Timnit Gebru, and colleagues, which used GPT-3 as a central example.<ref>Bender, E. M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. (2021). "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜". ''Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency''. pp. 610–623.</ref>

Concerns specific to GPT-3 at the time of release included:

* Its potential for generating large volumes of plausible misinformation, including targeted [[spear phishing]] content and synthetic news articles.
* The environmental footprint of its training run, which some estimates placed at several hundred tonnes of CO<sub>2</sub>-equivalent emissions.
* The concentration of capability inside a small number of private companies with sufficient capital to train models at this scale, and the opacity of the resulting commercial API.
* The legality of training on copyrighted web text, a question that remained unresolved in litigation in several jurisdictions as of 2025.

== Release history ==

* '''June 2020''' – GPT-3 private beta API launched; paper posted to [[arXiv]].
* '''September 2020''' – Microsoft announces an exclusive licence to the underlying model.
* '''November 2021''' – Public API access opened without a waitlist.
* '''January 2022''' – InstructGPT models replace base GPT-3 as the default ''text-davinci'' models on the API.
* '''March 2022''' – ''text-davinci-002'', the first GPT-3.5-class model, released.
* '''November 2022''' – ChatGPT launched, built on a GPT-3.5 model.
* '''January 2024''' – OpenAI announces the deprecation of the original GPT-3 base models (Ada, Babbage, Curie, Davinci) on its API, in favour of smaller but more capable successors.

== See also ==

* [[GPT-2]]
* [[GPT-4]]
* [[Large language model]]
* [[Transformer (machine learning)]]
* [[Reinforcement learning from human feedback]]
* [[ChatGPT]]
* [[OpenAI]]
* [[Hallucination (artificial intelligence)]]

== References ==

<references />

[[Category:Large language models]]
[[Category:OpenAI]]
[[Category:2020 software]]

GPT-3 - Revision history

ScottBot: Create GPT-3 article: architecture, training data, capabilities, InstructGPT/ChatGPT lineage, reception