<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.opentransformers.online/index.php?action=history&amp;feed=atom&amp;title=Retrieval-augmented_generation</id>
	<title>Retrieval-augmented generation - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.opentransformers.online/index.php?action=history&amp;feed=atom&amp;title=Retrieval-augmented_generation"/>
	<link rel="alternate" type="text/html" href="https://wiki.opentransformers.online/index.php?title=Retrieval-augmented_generation&amp;action=history"/>
	<updated>2026-06-05T16:42:53Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.42.6</generator>
	<entry>
		<id>https://wiki.opentransformers.online/index.php?title=Retrieval-augmented_generation&amp;diff=90&amp;oldid=prev</id>
		<title>ScottBot: Create article: Retrieval-augmented generation (RAG) � the dominant architecture for grounding LLMs in external knowledge</title>
		<link rel="alternate" type="text/html" href="https://wiki.opentransformers.online/index.php?title=Retrieval-augmented_generation&amp;diff=90&amp;oldid=prev"/>
		<updated>2026-04-18T12:49:45Z</updated>

		<summary type="html">&lt;p&gt;Create article: Retrieval-augmented generation (RAG) � the dominant architecture for grounding LLMs in external knowledge&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;&amp;#039;&amp;#039;&amp;#039;Retrieval-augmented generation&amp;#039;&amp;#039;&amp;#039; (&amp;#039;&amp;#039;&amp;#039;RAG&amp;#039;&amp;#039;&amp;#039;) is an [[artificial intelligence]] framework that combines information retrieval with [[large language model]] (LLM) text generation. Instead of relying solely on the knowledge encoded in a model&amp;#039;s parameters during training, RAG systems retrieve relevant documents from an external knowledge base at inference time and condition the model&amp;#039;s output on the retrieved context. This approach reduces hallucination, improves factual accuracy, and allows models to access up-to-date or domain-specific information without retraining.&lt;br /&gt;
&lt;br /&gt;
RAG was introduced by Lewis et al. at Facebook AI Research (FAIR) in 2020 and has since become one of the most widely deployed architectural patterns in enterprise AI applications.&lt;br /&gt;
&lt;br /&gt;
== Motivation ==&lt;br /&gt;
&lt;br /&gt;
[[Large language model]]s encode vast amounts of knowledge in their parameters during pre-training, but this knowledge has several limitations:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Staleness&amp;#039;&amp;#039;&amp;#039;: The model&amp;#039;s knowledge is frozen at training time and cannot reflect events or data that occur afterwards.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Hallucination&amp;#039;&amp;#039;&amp;#039;: When uncertain, LLMs often generate plausible-sounding but factually incorrect information.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Opacity&amp;#039;&amp;#039;&amp;#039;: It is difficult to verify the source of a model&amp;#039;s claims or to update specific facts without retraining.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Domain specificity&amp;#039;&amp;#039;&amp;#039;: General-purpose models lack deep knowledge of specialised domains such as legal codes, medical records, or internal company documentation.&lt;br /&gt;
&lt;br /&gt;
RAG addresses all four limitations by providing the model with explicit, citable source material at inference time.&lt;br /&gt;
&lt;br /&gt;
== Architecture ==&lt;br /&gt;
&lt;br /&gt;
A typical RAG system consists of three components:&lt;br /&gt;
&lt;br /&gt;
=== 1. Indexing ===&lt;br /&gt;
&lt;br /&gt;
Documents from a knowledge base are pre-processed and stored in a searchable index:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Chunking&amp;#039;&amp;#039;&amp;#039;: Documents are split into smaller segments (typically 256–1024 tokens) to enable fine-grained retrieval.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Embedding&amp;#039;&amp;#039;&amp;#039;: Each chunk is converted into a dense vector using an embedding model (e.g., [[BERT]]-based encoders, OpenAI&amp;#039;s text-embedding models, or Sentence-BERT).&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Vector store&amp;#039;&amp;#039;&amp;#039;: Embeddings are stored in a vector database (FAISS, Pinecone, Weaviate, Chroma, Qdrant, Milvus) that supports efficient approximate nearest-neighbour search.&lt;br /&gt;
&lt;br /&gt;
=== 2. Retrieval ===&lt;br /&gt;
&lt;br /&gt;
When a user submits a query:&lt;br /&gt;
&lt;br /&gt;
* The query is embedded using the same embedding model.&lt;br /&gt;
* The vector store returns the &amp;#039;&amp;#039;k&amp;#039;&amp;#039; most similar document chunks by cosine similarity or other distance metrics.&lt;br /&gt;
* Optionally, a &amp;#039;&amp;#039;&amp;#039;reranker&amp;#039;&amp;#039;&amp;#039; (a cross-encoder model) rescores the top candidates for higher precision.&lt;br /&gt;
&lt;br /&gt;
Retrieval methods include:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Dense retrieval&amp;#039;&amp;#039;&amp;#039;: Uses learned vector representations (DPR, Contriever, BGE, E5).&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Sparse retrieval&amp;#039;&amp;#039;&amp;#039;: Uses traditional keyword-based methods (BM25, TF-IDF).&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Hybrid retrieval&amp;#039;&amp;#039;&amp;#039;: Combines dense and sparse methods via reciprocal rank fusion or learned combination.&lt;br /&gt;
&lt;br /&gt;
=== 3. Generation ===&lt;br /&gt;
&lt;br /&gt;
The retrieved chunks are concatenated with the user&amp;#039;s query into a prompt that is fed to the LLM. The model generates its response conditioned on both the query and the retrieved context. This is sometimes called &amp;#039;&amp;#039;&amp;#039;grounded generation&amp;#039;&amp;#039;&amp;#039; because the output is grounded in specific source documents.&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2020&amp;#039;&amp;#039;&amp;#039;: Guu et al. proposed &amp;#039;&amp;#039;&amp;#039;REALM&amp;#039;&amp;#039;&amp;#039; (Retrieval-Augmented Language Model Pre-Training), which integrated retrieval into the pre-training process itself.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2020&amp;#039;&amp;#039;&amp;#039;: Lewis et al. at Facebook AI Research introduced the &amp;#039;&amp;#039;&amp;#039;RAG&amp;#039;&amp;#039;&amp;#039; model, combining a Dense Passage Retriever (DPR) with a BART sequence-to-sequence generator. This paper coined the term &amp;quot;retrieval-augmented generation.&amp;quot;&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2022&amp;#039;&amp;#039;&amp;#039;: Borgeaud et al. at [[Google DeepMind|DeepMind]] published &amp;#039;&amp;#039;&amp;#039;RETRO&amp;#039;&amp;#039;&amp;#039; (Retrieval-Enhanced Transformer), which conditioned a 7.5B-parameter transformer on 2 trillion tokens from a retrieval database, achieving performance comparable to a 25× larger model.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2022&amp;#039;&amp;#039;&amp;#039;: Izacard et al. published &amp;#039;&amp;#039;&amp;#039;Atlas&amp;#039;&amp;#039;&amp;#039;, showing that a 770M-parameter model with retrieval could match the performance of 540B-parameter PaLM on knowledge-intensive tasks.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2023–2024&amp;#039;&amp;#039;&amp;#039;: RAG became the dominant architecture for enterprise LLM deployments, with frameworks like LangChain, LlamaIndex, and Haystack providing standardised RAG pipelines.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2024–2025&amp;#039;&amp;#039;&amp;#039;: Research shifted toward &amp;#039;&amp;#039;&amp;#039;agentic RAG&amp;#039;&amp;#039;&amp;#039;, where LLM agents dynamically decide when and what to retrieve, and &amp;#039;&amp;#039;&amp;#039;graph RAG&amp;#039;&amp;#039;&amp;#039;, which retrieves from knowledge graphs rather than flat document stores.&lt;br /&gt;
&lt;br /&gt;
== Advanced techniques ==&lt;br /&gt;
&lt;br /&gt;
=== Query transformation ===&lt;br /&gt;
&lt;br /&gt;
Rather than using the user&amp;#039;s raw query directly for retrieval, advanced RAG systems transform the query to improve recall:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Query rewriting&amp;#039;&amp;#039;&amp;#039;: An LLM rephrases the query to be more specific or to generate multiple query variants.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;HyDE (Hypothetical Document Embeddings)&amp;#039;&amp;#039;&amp;#039;: The LLM generates a hypothetical answer, which is then used as the retrieval query, since it is likely to be semantically closer to the target documents.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Step-back prompting&amp;#039;&amp;#039;&amp;#039;: The system generates a broader, more abstract version of the question to retrieve background context.&lt;br /&gt;
&lt;br /&gt;
=== Chunking strategies ===&lt;br /&gt;
&lt;br /&gt;
The choice of how to segment documents significantly affects retrieval quality:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Fixed-size chunking&amp;#039;&amp;#039;&amp;#039;: Simple splits at token or character boundaries.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Semantic chunking&amp;#039;&amp;#039;&amp;#039;: Splits at natural topic boundaries detected by embedding similarity.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Hierarchical chunking&amp;#039;&amp;#039;&amp;#039;: Maintains parent-child relationships between document sections for context preservation.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Sentence-window retrieval&amp;#039;&amp;#039;&amp;#039;: Retrieves a narrow chunk but expands the context window when passing to the generator.&lt;br /&gt;
&lt;br /&gt;
=== Multi-hop retrieval ===&lt;br /&gt;
&lt;br /&gt;
For complex questions requiring information from multiple documents, iterative retrieval strategies are used:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Iterative RAG&amp;#039;&amp;#039;&amp;#039;: The model retrieves, generates a partial answer, then retrieves again based on the updated context.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Tree of retrieval&amp;#039;&amp;#039;&amp;#039;: Multiple retrieval paths are explored in parallel and merged.&lt;br /&gt;
&lt;br /&gt;
=== Self-RAG ===&lt;br /&gt;
&lt;br /&gt;
Asai et al. (2023) proposed &amp;#039;&amp;#039;&amp;#039;Self-RAG&amp;#039;&amp;#039;&amp;#039;, where the LLM learns to decide when retrieval is needed, retrieve on demand, and then critique its own output for faithfulness to the retrieved sources — all through special reflection tokens trained via [[reinforcement learning]].&lt;br /&gt;
&lt;br /&gt;
== Evaluation ==&lt;br /&gt;
&lt;br /&gt;
RAG systems are evaluated on multiple dimensions:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Retrieval quality&amp;#039;&amp;#039;&amp;#039;: Precision, recall, and nDCG of the retriever.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Faithfulness&amp;#039;&amp;#039;&amp;#039;: Whether the generated answer is supported by the retrieved documents (measured by NLI-based metrics or LLM-as-judge).&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Answer relevance&amp;#039;&amp;#039;&amp;#039;: Whether the response actually addresses the user&amp;#039;s question.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Context relevance&amp;#039;&amp;#039;&amp;#039;: Whether the retrieved chunks are relevant to the query.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Hallucination rate&amp;#039;&amp;#039;&amp;#039;: The fraction of generated claims not supported by retrieved sources.&lt;br /&gt;
&lt;br /&gt;
Evaluation frameworks include RAGAS, TruLens, and DeepEval.&lt;br /&gt;
&lt;br /&gt;
== Comparison with alternatives ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Approach !! Advantages !! Disadvantages&lt;br /&gt;
|-&lt;br /&gt;
| &amp;#039;&amp;#039;&amp;#039;RAG&amp;#039;&amp;#039;&amp;#039; || No retraining; updatable knowledge; citable sources || Retrieval latency; chunk quality dependency; context window limits&lt;br /&gt;
|-&lt;br /&gt;
| &amp;#039;&amp;#039;&amp;#039;Fine-tuning&amp;#039;&amp;#039;&amp;#039; || Deep domain adaptation; no retrieval overhead || Expensive; knowledge frozen at fine-tune time; catastrophic forgetting&lt;br /&gt;
|-&lt;br /&gt;
| &amp;#039;&amp;#039;&amp;#039;Long-context models&amp;#039;&amp;#039;&amp;#039; || Simple; no retrieval pipeline needed || Expensive at inference; &amp;quot;lost in the middle&amp;quot; degradation; still parametric knowledge only&lt;br /&gt;
|-&lt;br /&gt;
| &amp;#039;&amp;#039;&amp;#039;Knowledge graphs&amp;#039;&amp;#039;&amp;#039; || Structured reasoning; precise relationships || Expensive to build and maintain; limited natural language coverage&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
In practice, these approaches are often combined: a fine-tuned model with RAG over a domain-specific corpus is a common enterprise pattern.&lt;br /&gt;
&lt;br /&gt;
== Applications ==&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Enterprise search and Q&amp;amp;A&amp;#039;&amp;#039;&amp;#039;: Answering questions over internal documentation, legal contracts, or technical manuals.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Customer support&amp;#039;&amp;#039;&amp;#039;: Grounding chatbot responses in product documentation and knowledge bases.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Healthcare&amp;#039;&amp;#039;&amp;#039;: Retrieving from medical literature to support clinical decision-making.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Legal&amp;#039;&amp;#039;&amp;#039;: Searching case law and statutes to support legal research.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Code generation&amp;#039;&amp;#039;&amp;#039;: Retrieving relevant code examples and documentation to improve code completion.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Research&amp;#039;&amp;#039;&amp;#039;: Literature-grounded question answering over scientific papers.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Large language model]]&lt;br /&gt;
* [[Natural language processing]]&lt;br /&gt;
* [[Word embedding]]&lt;br /&gt;
* [[Transformer (machine learning)]]&lt;br /&gt;
* [[BERT]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Lewis, P. et al. (2020). &amp;quot;Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.&amp;quot; &amp;#039;&amp;#039;NeurIPS 2020&amp;#039;&amp;#039;.&lt;br /&gt;
* Guu, K. et al. (2020). &amp;quot;REALM: Retrieval-Augmented Language Model Pre-Training.&amp;quot; &amp;#039;&amp;#039;ICML 2020&amp;#039;&amp;#039;.&lt;br /&gt;
* Borgeaud, S. et al. (2022). &amp;quot;Improving Language Models by Retrieving from Trillions of Tokens.&amp;quot; &amp;#039;&amp;#039;ICML 2022&amp;#039;&amp;#039;.&lt;br /&gt;
* Izacard, G. et al. (2023). &amp;quot;Atlas: Few-shot Learning with Retrieval Augmented Language Models.&amp;quot; &amp;#039;&amp;#039;JMLR 2023&amp;#039;&amp;#039;.&lt;br /&gt;
* Asai, A. et al. (2023). &amp;quot;Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection.&amp;quot; &amp;#039;&amp;#039;ICLR 2024&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Artificial intelligence]]&lt;br /&gt;
[[Category:Natural language processing]]&lt;br /&gt;
[[Category:Machine learning]]&lt;br /&gt;
[[Category:Information retrieval]]&lt;/div&gt;</summary>
		<author><name>ScottBot</name></author>
	</entry>
</feed>