Part
5
  |  
RAG — Giving Claude a Memory
  |  
Chapter
16

What RAG Is and When You Need It

Everyone says they're doing RAG. Almost nobody is doing it well — because they skipped the part where you decide whether you need it at all.
Reading Time
12
mins
BACK TO CLAUDE MASTERCLASS

The fastest way to build a bad AI product is to add RAG to something that didn't need it. I watch teams do this every week. They read a blog post about retrieval-augmented generation, decide their chatbot needs a vector database, spend three sprints wiring up an embedding pipeline, and end up with a system that answers questions worse than Claude does out of the box — because the retrieval step is injecting irrelevant context that confuses the model. They didn't have a retrieval problem. They had a prompting problem. And now they have both.

The fastest way to build a bad AI product is to add RAG to something that didn't need it.

RAG is genuinely powerful when applied to the right problem. But it has become the "microservices" of the AI world — a pattern that teams adopt because it sounds sophisticated, not because their architecture demanded it. Before I walk you through how to build a RAG system that actually works, I need you to understand what RAG does, what it replaces, and the specific failure modes that make the difference between a system that grounds answers in real data and one that hallucinates with extra steps.

The Core Mechanic

Retrieval-augmented generation is a two-phase pattern. Phase one: given a user's question, search a knowledge base for documents that might contain the answer. Phase two: stuff those retrieved documents into the prompt alongside the question, and ask the language model to generate a response grounded in that context.

That's it. Every RAG system in existence — from a startup's MVP to Google's internal knowledge systems — follows this loop: question in, search happens, context assembled, model answers.

The flow looks like this in practice:

  1. A user asks a question.
  2. The question gets converted into a numerical representation (an embedding) that captures its semantic meaning.
  3. That representation is compared against a database of pre-embedded documents to find the closest matches.
  4. The top matches are pulled out and formatted into a context block.
  5. The context block and the original question are sent to Claude together.
  6. Claude generates an answer that draws on both the retrieved context and its own reasoning.

The critical insight is in step 5: Claude is not searching its own memory. It is reading documents you handed it, the same way a human expert reads a briefing packet before answering questions. The model's job is synthesis and reasoning. The retrieval system's job is making sure the right briefing packet lands on the desk.

Framework · The Briefing Packet Rule · BPR

A RAG system is only as good as the briefing packet it assembles. If you hand Claude the wrong documents, it will confidently synthesize the wrong answer. If you hand it no documents, it falls back to general knowledge. The retrieval layer is the editor — the model is the writer.

This separation matters because it tells you where to debug. When a RAG system gives a bad answer, the problem is almost always in retrieval (wrong documents surfaced) or context assembly (right documents surfaced but formatted badly), not in the model itself. I've seen teams burn weeks fine-tuning models when the actual fix was adjusting chunk sizes in their document splitter.

Retrieval vs. Generation: Different Jobs

The distinction between retrieval and generation is not just architectural — it reflects two fundamentally different kinds of intelligence.

Retrieval is about locating information that already exists. It depends on external knowledge stores — your company's documentation, your product's FAQ, your customer's uploaded files. The tools involved are search-oriented: vector databases, keyword indexes, embedding models that convert text into comparable numerical representations. The goal is to introduce accurate, relevant context into the pipeline.

Generation is about producing something new from what was found. It relies on the model's internal understanding of language, logic, and structure. The tools involved are language models — Claude, in our case — that can take a pile of context and a question and produce a coherent, useful answer. The goal is to transform raw context into a clear response.

✕ Retrieval alone
  • Returns raw document fragments
  • No synthesis or reasoning
  • User has to read and interpret themselves
  • Fast but unhelpful for complex questions
✓ Retrieval + Generation (RAG)
  • Returns a composed answer
  • Synthesizes across multiple sources
  • User gets a direct, grounded response
  • Slower but dramatically more useful

Neither phase works well without the other in the scenarios where RAG matters. Retrieval without generation gives you a search engine — useful, but not what users expect when they ask an AI a question. Generation without retrieval gives you a model that can only draw on its training data, which is months or years stale, and can never access your proprietary documents at all.

When You Actually Need RAG

RAG earns its complexity in four specific scenarios. If your use case doesn't fit one of these, you probably don't need it — and you should resist the urge to add it anyway.

Domain-specific question answering. Your users ask questions about content that Claude has never seen: internal company policies, proprietary product specifications, customer account details, or legal documents specific to your jurisdiction. Claude's training data cannot contain your company's Q3 earnings report or your internal engineering runbook. Retrieval bridges that gap by injecting the relevant document sections at query time.

Support automation with authoritative sources. You need answers that are grounded in your actual documentation — not Claude's general understanding of a topic. When a customer asks "What is your refund policy?", the answer must come from your policy document, not from Claude's inference about what refund policies typically look like. RAG ensures the model cites your source of truth.

Long document analysis. The source material is too large to fit in a single prompt — or there's too much of it to send everything every time. A 500-page technical manual, a codebase with thousands of files, a corpus of research papers. Retrieval selects the relevant sections so the model works with focused context instead of drowning in noise.

Freshness-sensitive applications. The information changes frequently. Stock prices, product inventory, news events, regulatory updates. Claude's training data has a cutoff. RAG lets you pull from a live data source that reflects the current state of the world.

Key takeaway

If your data is public, static, and small enough to fit in Claude's context window, you do not need RAG. Use a well-structured prompt with the data pasted directly in. RAG is for when the knowledge is private, voluminous, or changes faster than model retraining cycles.

A Minimal RAG Pipeline in Python

The simplest possible RAG system has three functions: load documents, search them, and ask Claude to answer using what the search found. Here's a version stripped to its essentials:

import os
import string
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()
client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

STOP_WORDS = {"what", "is", "the", "a", "an", "of", "to", "in", "for",
              "and", "or", "how", "does", "do", "can", "are", "this"}

def load_document(path: str) -> list[str]:
    """Load a text file and return it as a list of lines."""
    with open(path, "r", encoding="utf-8") as f:
        return [line.strip() for line in f if line.strip()]

def clean(text: str) -> str:
    return text.translate(str.maketrans("", "", string.punctuation)).lower()

def search(question: str, lines: list[str]) -> list[str]:
    """Keyword search: return lines containing any meaningful word from the question."""
    keywords = {w for w in clean(question).split() if w not in STOP_WORDS and len(w) > 2}
    return [line for line in lines if any(kw in clean(line).split() for kw in keywords)]

def ask_claude(question: str, context_lines: list[str]) -> str:
    """Send the question and retrieved context to Claude for a grounded answer."""
    context = "\n".join(context_lines)
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        temperature=0,
        messages=[{
            "role": "user",
            "content": (
                f"Answer the following question using ONLY the context provided.\n\n"
                f"<context>\n{context}\n</context>\n\n"
                f"Question: {question}"
            ),
        }],
    )
    return response.content[0].text

# Usage
knowledge = load_document("knowledge.txt")
question = "What is prompt engineering?"
matches = search(question, knowledge)

if matches:
    answer = ask_claude(question, matches)
    print(answer)
else:
    print("No relevant information found in the knowledge base.")

This works. It's ugly, it uses keyword matching instead of semantic search, and it won't scale past a few hundred lines of text. But it demonstrates the complete RAG loop: load, retrieve, generate. Every production system I've built started from something this simple before adding embeddings, ranking, and caching.

Why keyword search first?

Starting with keyword matching instead of embeddings forces you to understand what retrieval is doing before you abstract it away. If your keyword search returns garbage, your problem is in document preparation — not in your embedding model. Fix the foundation before you add layers.

The Four Failure Modes

Every RAG system I've debugged has failed in one of four ways. Know these before you build, and you'll save yourself weeks.

Wrong documents retrieved. The search returns documents that match keywords but miss the semantic intent. The user asks about "Python decorators" and gets back a document about snake habitats because it mentions "python." This is a retrieval precision problem — your search isn't smart enough to distinguish meaning from surface patterns.

Right documents, wrong section. The correct document is retrieved, but the chunk that lands in context is the introduction or the table of contents instead of the paragraph that actually answers the question. This is a chunking problem — your document splitting strategy created chunks that don't align with meaningful content boundaries.

Context overflow. Too many documents are retrieved and stuffed into the prompt, diluting the signal. Claude spends tokens processing irrelevant context and either misses the answer buried in the noise or synthesizes a mushy average of everything it read. This is a ranking problem — you need to be more selective about what makes the cut.

Hallucination despite context. Claude has the right documents in context but generates an answer that goes beyond what the documents say, blending retrieved facts with its own training data. This is a prompt engineering problem — your instructions to Claude aren't strict enough about staying within the provided context.

When a RAG system gives a bad answer, the problem is almost always in retrieval or context assembly — not in the model itself.

I've seen this pattern where a team spends three weeks tuning their embedding model to improve retrieval, only to discover the problem was a missing instruction in the system prompt. They told Claude to "answer using the context below" but never told Claude to refuse when the context was insufficient. Claude, being helpful by default, filled the gap with training data. Adding one sentence — "If the context does not contain the answer, say so explicitly" — fixed 60% of their hallucination issues overnight. No embedding changes. No reranking. Just a better prompt.

The context-only instruction

Never send context to Claude without an explicit grounding instruction. The prompt must tell Claude: use only these sources, cite which source you drew from, and say when the sources are insufficient. Without this, you get a system that confidently blends retrieved facts with general knowledge — and you cannot tell which parts are grounded and which are hallucinated.

The RAG Decision Tree

Before you commit to building a RAG system, run your use case through this decision sequence. Each question eliminates a class of unnecessary complexity.

Is the data small enough to paste into the prompt? If your entire knowledge base is under 50,000 tokens — roughly 40 pages of text — you don't need retrieval at all. Send the whole thing as context. Claude's context window is large enough for this, and you eliminate the entire retrieval pipeline's failure modes. This approach works for most single-document use cases: one policy manual, one product spec, one API reference.

Is the data static? If your knowledge base changes less than once per month and fits within context, consider a cached approach where you pre-compute the prompt with the full document and reuse it across queries. No vector database, no embedding costs, no chunking strategy to tune.

Do you need sourced answers? If your users need to verify answers against original documents — legal, medical, financial, or compliance use cases — RAG's source attribution capability becomes essential. The alternative, Claude answering from training data, provides no audit trail and no way for the user to confirm accuracy.

Is latency critical? RAG adds 200-500ms to every query for the retrieval and ranking steps, plus the embedding API call. If your application requires sub-second responses, you need to either pre-compute retrievals, cache frequent queries, or accept the latency as part of the UX.

From Toy to Production: The Gap

The pipeline above is a sketch. Production RAG systems close four gaps that the sketch ignores:

Semantic search replaces keyword matching. Instead of checking whether the word "prompt" appears in a line, you convert both the question and every document chunk into high-dimensional vectors and find the closest matches by mathematical similarity. This catches paraphrases, synonyms, and conceptual relationships that keywords miss entirely.

Chunking replaces line splitting. Real documents aren't organized one-fact-per-line. You need strategies for splitting PDFs, Markdown files, and HTML pages into chunks that preserve meaning — and a way to decide how big those chunks should be.

Ranking replaces "take the first N matches." A retriever casts a wide net. A ranker narrows it down to the documents that are genuinely most useful for answering this specific question. Two-stage retrieval — broad retrieval followed by precise reranking — is the standard pattern in production systems.

Evaluation replaces eyeballing. You need metrics: retrieval precision, answer faithfulness, context relevance. Without them, you're guessing whether your changes are improvements or regressions.

The next two chapters walk through each of these gaps in detail — chunking and embeddings in Chapter 17, then a full production pipeline with retrieval, ranking, and MCP integration in Chapter 18.

What to Do Monday Morning

Audit your current AI integration for RAG necessity

Before adding retrieval, answer this: is Claude giving bad answers because it lacks access to specific documents, or because your prompts are vague? Test by pasting the relevant document directly into your prompt. If that fixes the answer quality, you have a retrieval problem and RAG is warranted. If it doesn't, you have a prompting problem and RAG will only add complexity.

Build the minimal keyword pipeline

Take the code from this chapter, point it at a real knowledge base file (your company FAQ, product docs, or internal wiki export), and run ten questions through it. Record which questions get good answers and which don't. This gives you a baseline before you add any sophisticated retrieval.

Categorize your failures

For every bad answer from your baseline pipeline, label the failure mode: wrong documents, wrong section, context overflow, or hallucination despite context. Tally them. The category with the most failures tells you what to fix first — chunking, retrieval, ranking, or prompt engineering.

Set up a test harness with ten golden questions

Write ten questions where you know the correct answer from your knowledge base. Store the question, the expected answer, and the source document. Run your pipeline against this set after every change. This is the minimum viable evaluation that keeps you honest.

RAG is not a feature you add. It is an architecture you commit to — with retrieval quality, chunking strategy, and ranking precision as ongoing engineering obligations.