Retrieval-augmented generation

From OpenEncyclopedia

Retrieval-augmented generation (RAG) is an artificial intelligence framework that combines information retrieval with large language model (LLM) text generation. Instead of relying solely on the knowledge encoded in a model's parameters during training, RAG systems retrieve relevant documents from an external knowledge base at inference time and condition the model's output on the retrieved context. This approach reduces hallucination, improves factual accuracy, and allows models to access up-to-date or domain-specific information without retraining.

RAG was introduced by Lewis et al. at Facebook AI Research (FAIR) in 2020 and has since become one of the most widely deployed architectural patterns in enterprise AI applications.

Motivation

Large language models encode vast amounts of knowledge in their parameters during pre-training, but this knowledge has several limitations:

  • Staleness: The model's knowledge is frozen at training time and cannot reflect events or data that occur afterwards.
  • Hallucination: When uncertain, LLMs often generate plausible-sounding but factually incorrect information.
  • Opacity: It is difficult to verify the source of a model's claims or to update specific facts without retraining.
  • Domain specificity: General-purpose models lack deep knowledge of specialised domains such as legal codes, medical records, or internal company documentation.

RAG addresses all four limitations by providing the model with explicit, citable source material at inference time.

Architecture

A typical RAG system consists of three components:

1. Indexing

Documents from a knowledge base are pre-processed and stored in a searchable index:

  • Chunking: Documents are split into smaller segments (typically 256–1024 tokens) to enable fine-grained retrieval.
  • Embedding: Each chunk is converted into a dense vector using an embedding model (e.g., BERT-based encoders, OpenAI's text-embedding models, or Sentence-BERT).
  • Vector store: Embeddings are stored in a vector database (FAISS, Pinecone, Weaviate, Chroma, Qdrant, Milvus) that supports efficient approximate nearest-neighbour search.

2. Retrieval

When a user submits a query:

  • The query is embedded using the same embedding model.
  • The vector store returns the k most similar document chunks by cosine similarity or other distance metrics.
  • Optionally, a reranker (a cross-encoder model) rescores the top candidates for higher precision.

Retrieval methods include:

  • Dense retrieval: Uses learned vector representations (DPR, Contriever, BGE, E5).
  • Sparse retrieval: Uses traditional keyword-based methods (BM25, TF-IDF).
  • Hybrid retrieval: Combines dense and sparse methods via reciprocal rank fusion or learned combination.

3. Generation

The retrieved chunks are concatenated with the user's query into a prompt that is fed to the LLM. The model generates its response conditioned on both the query and the retrieved context. This is sometimes called grounded generation because the output is grounded in specific source documents.

History

  • 2020: Guu et al. proposed REALM (Retrieval-Augmented Language Model Pre-Training), which integrated retrieval into the pre-training process itself.
  • 2020: Lewis et al. at Facebook AI Research introduced the RAG model, combining a Dense Passage Retriever (DPR) with a BART sequence-to-sequence generator. This paper coined the term "retrieval-augmented generation."
  • 2022: Borgeaud et al. at DeepMind published RETRO (Retrieval-Enhanced Transformer), which conditioned a 7.5B-parameter transformer on 2 trillion tokens from a retrieval database, achieving performance comparable to a 25× larger model.
  • 2022: Izacard et al. published Atlas, showing that a 770M-parameter model with retrieval could match the performance of 540B-parameter PaLM on knowledge-intensive tasks.
  • 2023–2024: RAG became the dominant architecture for enterprise LLM deployments, with frameworks like LangChain, LlamaIndex, and Haystack providing standardised RAG pipelines.
  • 2024–2025: Research shifted toward agentic RAG, where LLM agents dynamically decide when and what to retrieve, and graph RAG, which retrieves from knowledge graphs rather than flat document stores.

Advanced techniques

Query transformation

Rather than using the user's raw query directly for retrieval, advanced RAG systems transform the query to improve recall:

  • Query rewriting: An LLM rephrases the query to be more specific or to generate multiple query variants.
  • HyDE (Hypothetical Document Embeddings): The LLM generates a hypothetical answer, which is then used as the retrieval query, since it is likely to be semantically closer to the target documents.
  • Step-back prompting: The system generates a broader, more abstract version of the question to retrieve background context.

Chunking strategies

The choice of how to segment documents significantly affects retrieval quality:

  • Fixed-size chunking: Simple splits at token or character boundaries.
  • Semantic chunking: Splits at natural topic boundaries detected by embedding similarity.
  • Hierarchical chunking: Maintains parent-child relationships between document sections for context preservation.
  • Sentence-window retrieval: Retrieves a narrow chunk but expands the context window when passing to the generator.

Multi-hop retrieval

For complex questions requiring information from multiple documents, iterative retrieval strategies are used:

  • Iterative RAG: The model retrieves, generates a partial answer, then retrieves again based on the updated context.
  • Tree of retrieval: Multiple retrieval paths are explored in parallel and merged.

Self-RAG

Asai et al. (2023) proposed Self-RAG, where the LLM learns to decide when retrieval is needed, retrieve on demand, and then critique its own output for faithfulness to the retrieved sources — all through special reflection tokens trained via reinforcement learning.

Evaluation

RAG systems are evaluated on multiple dimensions:

  • Retrieval quality: Precision, recall, and nDCG of the retriever.
  • Faithfulness: Whether the generated answer is supported by the retrieved documents (measured by NLI-based metrics or LLM-as-judge).
  • Answer relevance: Whether the response actually addresses the user's question.
  • Context relevance: Whether the retrieved chunks are relevant to the query.
  • Hallucination rate: The fraction of generated claims not supported by retrieved sources.

Evaluation frameworks include RAGAS, TruLens, and DeepEval.

Comparison with alternatives

Approach Advantages Disadvantages
RAG No retraining; updatable knowledge; citable sources Retrieval latency; chunk quality dependency; context window limits
Fine-tuning Deep domain adaptation; no retrieval overhead Expensive; knowledge frozen at fine-tune time; catastrophic forgetting
Long-context models Simple; no retrieval pipeline needed Expensive at inference; "lost in the middle" degradation; still parametric knowledge only
Knowledge graphs Structured reasoning; precise relationships Expensive to build and maintain; limited natural language coverage

In practice, these approaches are often combined: a fine-tuned model with RAG over a domain-specific corpus is a common enterprise pattern.

Applications

  • Enterprise search and Q&A: Answering questions over internal documentation, legal contracts, or technical manuals.
  • Customer support: Grounding chatbot responses in product documentation and knowledge bases.
  • Healthcare: Retrieving from medical literature to support clinical decision-making.
  • Legal: Searching case law and statutes to support legal research.
  • Code generation: Retrieving relevant code examples and documentation to improve code completion.
  • Research: Literature-grounded question answering over scientific papers.

See also

References

  • Lewis, P. et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020.
  • Guu, K. et al. (2020). "REALM: Retrieval-Augmented Language Model Pre-Training." ICML 2020.
  • Borgeaud, S. et al. (2022). "Improving Language Models by Retrieving from Trillions of Tokens." ICML 2022.
  • Izacard, G. et al. (2023). "Atlas: Few-shot Learning with Retrieval Augmented Language Models." JMLR 2023.
  • Asai, A. et al. (2023). "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." ICLR 2024.