GPT-3

Template:Short description

Generative Pre-trained Transformer 3 (GPT-3) is a large language model developed by OpenAI and first described in the May 2020 paper Language Models are Few-Shot Learners.^[1] At 175 billion parameters, it was at the time of its release the largest dense transformer language model ever trained, roughly ten times larger than its predecessor, Microsoft's Turing NLG (17 billion parameters), and more than one hundred times larger than GPT-2 (1.5 billion parameters). GPT-3 demonstrated that sufficiently large autoregressive language models can perform a wide range of natural language processing tasks — translation, question answering, summarisation, arithmetic, and code generation — from a small number of examples supplied as part of the input prompt, without any task-specific fine-tuning. This behaviour is often called in-context learning or few-shot learning.

GPT-3 was made available to selected developers in June 2020 through a commercial API, and OpenAI subsequently granted Microsoft an exclusive licence to the underlying model in September 2020. The model, its fine-tuned descendants InstructGPT and GPT-3.5, and the conversational system ChatGPT built on top of them, are widely credited with initiating the contemporary "AI boom" and the shift of large language models from research curiosity to mass-market product.

Architecture

GPT-3 is a decoder-only transformer trained with a standard autoregressive language modelling objective: given a sequence of byte-pair-encoded tokens, the model predicts the next token, and the training loss is the cross-entropy between the predicted distribution and the observed token. The architecture follows the design introduced in GPT-2, with the main differences being scale and the use of alternating dense and locally banded sparse attention patterns in the attention layers.

The largest variant, conventionally just called "GPT-3" or "GPT-3 175B", has the following configuration:^[1]

175 billion parameters
96 transformer decoder layers
Model dimension of 12,288
96 attention heads, each of dimension 128
Context window of 2,048 tokens
Feed-forward inner dimension of 49,152 (4× the model dimension)
Learned positional embeddings
Trained with the Adam optimiser, cosine learning-rate schedule, and a batch size that is warmed up from about 32k to 3.2 million tokens

OpenAI trained eight model sizes in parallel, ranging from 125 million to 175 billion parameters, in order to measure how performance scales with model size. The eight models were used to extend earlier empirical scaling laws,^[2] showing that loss on held-out text continues to fall smoothly as model size, dataset size, and compute are increased together.

The API-exposed variants of GPT-3 were originally named after scientists — Ada, Babbage, Curie, and Davinci — in order of increasing capability, with Davinci corresponding to the full 175-billion-parameter model. These names were retained for several years after the API launch.

Training data

GPT-3 was trained on approximately 300 billion tokens drawn from five sources, mixed with non-uniform sampling weights so that higher-quality corpora were seen more often during training:^[1]

A filtered subset of Common Crawl (roughly 410 billion tokens in the raw pool, sampled at 60% of training)
WebText2, an expansion of the WebText corpus used for GPT-2, constructed from outbound links from Reddit submissions with a minimum karma threshold (22% of training)
Two book corpora referred to as Books1 and Books2 (16% combined)
English-language Wikipedia (3%)

Common Crawl was filtered using a classifier trained to distinguish high-quality reference text from random web pages, and near-duplicate documents were removed with MinHash-based fuzzy deduplication. Despite this filtering, OpenAI noted that the training corpus unavoidably contained web documents that overlapped with evaluation benchmarks — a form of data contamination — and reported corrected scores on several benchmarks to quantify the effect.

Capabilities

Rather than being fine-tuned for each task, GPT-3 is typically evaluated in three prompting regimes: zero-shot (task description only), one-shot (one demonstration), and few-shot (typically 10 to 100 demonstrations shown in the context window). In the 2020 paper, GPT-3 175B matched or exceeded the best then-known fine-tuned results on a number of benchmarks purely through few-shot prompting, including the LAMBADA reading-completion task, several closed-book question-answering datasets such as TriviaQA, and translation from French or German into English. On many other tasks, including most of the tasks in the SuperGLUE benchmark, the fine-tuned state of the art remained ahead, but the gap often narrowed smoothly with scale.

GPT-3 also demonstrated non-trivial performance on tasks that had not been deliberately included in its training objective, including three-digit arithmetic, SAT-style analogies, unscrambling permuted words, and generating short computer programs from natural-language descriptions. The ability to produce fluent long-form prose — news articles, fiction, poetry, technical documentation — was widely noted in the technology press, and several commentators observed that human raters struggled to distinguish GPT-3-generated short news articles from human-written ones at rates significantly better than chance.

The model has well-documented limitations. Its outputs are not grounded in any explicit fact base, and it will confidently produce plausible-sounding but incorrect statements, a failure mode now generally called hallucination. Performance on tasks that require multi-step symbolic reasoning, such as long arithmetic or proof synthesis, degrades sharply once the number of required steps exceeds a small threshold. GPT-3 also inherits biases from its training data and was shown in the original paper to produce systematically different sentiment distributions when prompted with different racial, gender, and religious descriptors.

Fine-tuned descendants

InstructGPT

Because the base GPT-3 model is trained only on next-token prediction, its behaviour when given instructions is often unhelpful — it may continue the prompt stylistically rather than answer it. In early 2022, OpenAI released InstructGPT, a family of GPT-3 variants fine-tuned on human demonstrations of desired behaviour and further aligned using reinforcement learning from human feedback (RLHF).^[3] A 1.3-billion-parameter InstructGPT model was preferred by human annotators over the full 175-billion-parameter base GPT-3 more than half the time, a result that drew attention to the outsized role of alignment techniques relative to raw scale.

GPT-3.5 and ChatGPT

OpenAI subsequently trained further fine-tuned models on GPT-3-class base models, collectively marketed as GPT-3.5. The conversational assistant ChatGPT, released as a research preview on 30 November 2022, was initially based on a GPT-3.5 model. ChatGPT's rapid adoption — reaching an estimated 100 million monthly active users within two months of launch — is widely cited as the beginning of the mainstream AI boom of the 2020s.

Reception and criticism

GPT-3 received substantial coverage in both the technical and general press on its release. Supporters emphasised its versatility and the smoothness of its scaling behaviour; critics argued that the apparent understanding displayed by the model was illusory, an argument most influentially developed in the paper "On the Dangers of Stochastic Parrots" by Emily Bender, Timnit Gebru, and colleagues, which used GPT-3 as a central example.^[4]

Concerns specific to GPT-3 at the time of release included:

Its potential for generating large volumes of plausible misinformation, including targeted spear phishing content and synthetic news articles.
The environmental footprint of its training run, which some estimates placed at several hundred tonnes of CO₂-equivalent emissions.
The concentration of capability inside a small number of private companies with sufficient capital to train models at this scale, and the opacity of the resulting commercial API.
The legality of training on copyrighted web text, a question that remained unresolved in litigation in several jurisdictions as of 2025.

Release history

June 2020 – GPT-3 private beta API launched; paper posted to arXiv.
September 2020 – Microsoft announces an exclusive licence to the underlying model.
November 2021 – Public API access opened without a waitlist.
January 2022 – InstructGPT models replace base GPT-3 as the default text-davinci models on the API.
March 2022 – text-davinci-002, the first GPT-3.5-class model, released.
November 2022 – ChatGPT launched, built on a GPT-3.5 model.
January 2024 – OpenAI announces the deprecation of the original GPT-3 base models (Ada, Babbage, Curie, Davinci) on its API, in favour of smaller but more capable successors.

References

↑ ^1.0 ^1.1 ^1.2 Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; et al. (2020). "Language Models are Few-Shot Learners". arXiv:2005.14165.
↑ Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T. B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. (2020). "Scaling Laws for Neural Language Models". arXiv:2001.08361.
↑ Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C. L.; Mishkin, P.; Zhang, C.; et al. (2022). "Training Language Models to Follow Instructions with Human Feedback". arXiv:2203.02155.
↑ Bender, E. M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. (2021). "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜". Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. pp. 610–623.

[brown2020-1] 1.0 ^1.1 ^1.2 Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; et al. (2020). "Language Models are Few-Shot Learners". arXiv:2005.14165.

[2] Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T. B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. (2020). "Scaling Laws for Neural Language Models". arXiv:2001.08361.

[3] Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C. L.; Mishkin, P.; Zhang, C.; et al. (2022). "Training Language Models to Follow Instructions with Human Feedback". arXiv:2203.02155.

[4] Bender, E. M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. (2021). "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜". Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. pp. 610–623.

[1]

[2]

[3]

[4]