AI Concepts

What Is RAG

Overview

RAG stands for Retrieval-Augmented Generation. It is an architecture pattern where a language model retrieves relevant context from a trusted knowledge base before generating a response.

RAG is the most practical and widely deployed method for making AI systems factually reliable in production. It solves the two problems that make raw LLMs unsuitable for most business use cases: knowledge cutoffs and hallucination.

Why RAG Exists

Foundation models are trained on large datasets up to a cutoff date. After that cutoff, they know nothing about what has changed. More critically, they do not know anything that was never public — your internal policies, your product specifications, your pricing, your procedures.

When a foundation model is asked something outside its training data, it does not say "I don't know." It generates a plausible-sounding answer. This is hallucination — confident output that is factually wrong. In a business context, this is a liability.

RAG fixes this by separating the knowledge store from the model. The model reasons and generates language. The knowledge store provides factual grounding. At query time, the system retrieves the most relevant content and injects it into the model's context window before generating a response. The result is an AI system that answers from your approved, current sources — not from the model's frozen training data.

How RAG Works

Step 1: Ingestion

Source documents are collected, cleaned, and prepared. Sources can include internal documentation, product manuals, policy files, FAQ content, CRM data, support transcripts, or any structured or unstructured text. Quality at this step determines quality throughout the system.

Step 2: Chunking

Documents are split into chunks — smaller units that can be independently retrieved. Chunk size is a design decision with real consequences. Chunks that are too small lose context and produce retrieval misses. Chunks that are too large dilute relevance and increase latency and cost.

Typical starting points: 256 to 512 tokens for Q&A and support use cases; 512 to 1,024 tokens for longer-form document retrieval. Overlap between chunks of 10 to 20 percent reduces the risk of splitting important information across boundaries. Test multiple configurations on your actual documents before committing to one.

Step 3: Embedding and Indexing

Each chunk is converted into a vector representation — an embedding — that encodes its semantic meaning. These vectors are stored in a retrieval index. Purpose-built vector databases — Pinecone, Weaviate, Qdrant, Chroma, or pgvector for PostgreSQL — support fast similarity search across millions of vectors.

Embedding model choice affects retrieval quality directly. OpenAI's text-embedding-3-large and Cohere's embed-v3 are strong general-purpose options. Domain-specific embedding models can outperform general ones for specialist content.

Step 4: Retrieval

When a user submits a query, it is embedded using the same model used at indexing time. The system searches the index for the chunks most similar to the query vector.

Dense retrieval. Pure vector similarity search. Strong for semantic questions where keyword matching would fail.

Sparse retrieval (BM25). Traditional keyword-based search. Strong for exact-term matching — product codes, names, specific phrases.

Hybrid retrieval. Combines dense and sparse scores with weighted fusion. Outperforms either alone for most production use cases. Hybrid is the recommended default.

Step 5: Reranking

The top-k retrieved chunks are passed to a reranker — a model trained to score relevance more precisely than embedding similarity alone. Rerankers like Cohere Rerank or cross-encoder models read both the query and each candidate chunk together. Reranking adds latency but meaningfully improves precision. The quality gain is almost always worth the cost.

Step 6: Augmentation and Generation

The highest-scoring chunks are injected into the model's prompt as context. The model is instructed to answer using only that context and to cite its sources. Citation discipline is not optional for production systems. Without it, users cannot verify answers and the system cannot be audited. Every factual claim in a RAG output should trace to a specific retrieved chunk.

Critical Design Decisions

Chunk size and overlap. The single most impactful implementation choice. Wrong chunking creates retrieval failures that no amount of model quality can fix. Test multiple configurations on at least 50 representative queries before committing.

Retrieval depth (top-k). Too few and you miss relevant content. Too many and you dilute the context window. Typical pattern: retrieve top-20, rerank to top-5.

Index freshness. For pricing, availability, and regulatory content the answer is usually daily or real-time. Stale indexes are a business risk — a system confidently quoting last quarter's pricing is worse than no AI at all.

Guardrails and fallback. What happens when retrieval returns nothing relevant? The system must have a defined fallback — escalate to human or clearly signal low confidence. A system that generates when retrieval returns nothing is a hallucination machine with a RAG label.

RAG vs Fine-Tuning

Use RAG when: information changes frequently; answers must trace to a specific source; you need to update knowledge without retraining; the primary problem is factuality not style.

Use fine-tuning when: style and tone consistency is the primary problem; you are doing a specialised task like legal clause extraction; you have thousands of high-quality labelled examples.

Start with RAG. Add fine-tuning only when behaviour consistency is still a problem after prompt engineering is exhausted. The most common mistake: jumping to fine-tuning when the underlying problem was retrieval quality.

Quality Metrics for Production RAG

Retrieval recall: what percentage of queries retrieve at least one relevant chunk? Target above 90 percent.

Retrieval precision: of the chunks retrieved, what proportion are actually relevant? Low precision increases hallucination risk even when recall is high.

Answer grounded rate: what percentage of answers are fully supported by retrieved chunks? Measure with an LLM-as-judge pipeline or structured human labelling.

Hallucination rate: what percentage of answers contain claims not supported by retrieved context? The metric that matters most for trust and compliance.

Answer latency: end-to-end time from query to answer. P95 latency above 8 to 10 seconds typically causes abandonment in customer-facing applications.

Resolution rate: what percentage of queries are resolved without escalation? This ties retrieval and generation quality to the business outcome.

Common Failure Modes

Chunking too coarse or too fine. Fix: test configurations on representative queries before production.

Stale index. Fix: define index refresh frequency as a production SLA, not an afterthought.

No reranking. Fix: add a reranker — the latency cost is almost always worth the quality gain.

No citation requirement. Fix: enforce citation in the system prompt and validate in the evaluation pipeline.

No fallback for low-confidence retrieval. Fix: set a minimum similarity threshold below which the system escalates rather than generates.

Treating all documents equally. Fix: build source quality and recency signals into your retrieval scoring.

Related Guides

References


Talk to an AI Implementation Expert

If you want to deploy a production-grade RAG system, book a working session.

Book a call: https://calendly.com/ai-creation-labs/30-minute-chatgpt-leads-discovery-call

We can cover:

  • RAG architecture and tooling choices for your use case
  • chunking strategy and retrieval configuration
  • evaluation setup and quality benchmarks
  • governance, citation standards, and rollout plan

Need implementation support?

Book a 30-minute call and we can map your use case, architecture options, and rollout plan.

Book a 30-minute strategy call