Chapter 7 — Advanced Topics

Part 1: RAG — Retrieval-Augmented Generation

RAG is the most widely deployed LLM technique in production. When the answer isn't in the model's weights, RAG fetches it from external knowledge. This chapter builds the full picture — from chunking strategy to production gotchas.

Why LLMs Hallucinate and Why RAG Helps

LLMs encode world knowledge in their weights during training. But this creates two problems:

Stale knowledge: Training data has a cutoff. GPT-4's knowledge ends in April 2023. Ask it about recent events → hallucination or refusal.
Domain knowledge gaps: Your company's internal documentation, your medical records, your codebase — none of these were in training data. The model can't know them.

RAG solves this by giving the model access to a searchable knowledge base at inference time:

User question
     ↓
[Retrieve relevant documents from knowledge base]
     ↓
Inject documents into the prompt as context
     ↓
LLM generates answer based on retrieved context

The LLM is now reasoning over retrieved facts rather than its memorized knowledge. Much less hallucination. Always up-to-date.

The RAG Pipeline: Every Component

The RAG pipeline has two phases: an offline indexing phase (done once) and an online retrieval phase (done per query).

flowchart TD
    subgraph Offline["Offline — Build the Index (run once)"]
        D[Raw Documents] --> CH[Chunk into passages]
        CH --> EM[Embed with embedding model]
        EM --> VS[(Vector Store)]
    end
    subgraph Online["Online — Answer a Query (per request)"]
        Q([User Query]) --> QE[Embed query]
        QE --> RET[Retrieve top-k by similarity]
        VS --> RET
        RET --> CTX[Augment prompt with chunks]
        CTX --> LLM[LLM generates grounded answer]
        LLM --> ANS([Response])
    end
    style Offline fill:#fafaf9,stroke:#e7e5e4
    style Online fill:#fafaf9,stroke:#e7e5e4
    style VS fill:#f5f5f4,stroke:#a8a29e

Step 1: Document Processing (Offline)

from langchain.text_splitter import RecursiveCharacterTextSplitter

def process_documents(documents):
    """
    Split documents into chunks for embedding.

    Chunking strategy matters enormously:
    - Too small: each chunk lacks context
    - Too large: relevant info is diluted; hits context limits
    - Overlap: ensures information at chunk boundaries isn't missed
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=512,       # Characters per chunk (experiment with this!)
        chunk_overlap=64,     # Characters of overlap between adjacent chunks
        separators=["\n\n", "\n", ". ", " ", ""],  # Split order of preference
    )
    return splitter.split_documents(documents)

Chunking strategies:

Fixed size: Simple, predictable. chunk_size=512 tokens with overlap.
Sentence-based: Split on sentence boundaries. More semantically coherent.
Semantic chunking: Use embeddings to find natural topic boundaries. Best quality, slowest.
Hierarchical: Store both sentence-level and paragraph-level chunks. Retrieve fine, return coarse.

Step 2: Embedding (Offline)

Convert each chunk to a dense vector using an embedding model:

from sentence_transformers import SentenceTransformer
import numpy as np

# Load an embedding model
# Best open-source: bge-large-en-v1.5, e5-large-v2, nomic-embed-text
embedder = SentenceTransformer("BAAI/bge-large-en-v1.5")

def embed_chunks(chunks):
    texts = [chunk.page_content for chunk in chunks]
    # Output: (n_chunks, embedding_dim) — e.g., (1000, 1024)
    embeddings = embedder.encode(texts, normalize_embeddings=True)
    return embeddings

Key embedding models:

Model	Dim	Context	Quality
`BAAI/bge-large-en-v1.5`	1024	512 tokens	Excellent
`text-embedding-3-small` (OpenAI)	1536	8191 tokens	Very good, paid
`nomic-embed-text`	768	8192 tokens	Good, open
`e5-large-v2`	1024	512 tokens	Very good

Step 3: Vector Store (Offline)

Store embeddings in a vector database that supports similarity search:

import faiss
import numpy as np

class SimpleVectorStore:
    def __init__(self, dimension):
        # FAISS: Facebook AI Similarity Search
        # IndexFlatIP: exact inner product search (for normalized vectors = cosine similarity)
        self.index = faiss.IndexFlatIP(dimension)
        self.chunks = []

    def add(self, chunks, embeddings):
        """Add chunks and their embeddings to the store."""
        self.index.add(embeddings.astype(np.float32))
        self.chunks.extend(chunks)

    def search(self, query_embedding, top_k=5):
        """Find top_k most similar chunks to the query."""
        scores, indices = self.index.search(
            query_embedding.reshape(1, -1).astype(np.float32),
            top_k
        )
        results = [(self.chunks[i], scores[0][j])
                   for j, i in enumerate(indices[0]) if i >= 0]
        return results

Vector databases for production:

Database	Hosted	Open Source	Best For
Chroma	Local	✅	Dev/prototyping
FAISS	Local	✅	CPU-efficient search
Pinecone	Cloud	❌	Managed, scalable
Weaviate	Both	✅	GraphQL, filters
Qdrant	Both	✅	Fast, Rust-based
pgvector	Self-host	✅	Postgres extension, familiar

Step 4: Query & Retrieve (Online)

def retrieve(query, vector_store, embedder, top_k=5):
    """Embed query and retrieve most relevant chunks."""
    query_embedding = embedder.encode([query], normalize_embeddings=True)[0]
    results = vector_store.search(query_embedding, top_k=top_k)
    return results

Step 5: Generate with Context (Online)

def rag_generate(query, retrieved_chunks, llm):
    """Build prompt with retrieved context and generate answer."""

    # Format retrieved chunks as context
    context = "\n\n".join([
        f"[Document {i+1}]\n{chunk.page_content}"
        for i, (chunk, score) in enumerate(retrieved_chunks)
    ])

    # Prompt engineering matters here
    prompt = f"""You are a helpful assistant. Answer the question based ONLY on the provided context.
If the context doesn't contain the answer, say "I don't have that information."

Context:
{context}

Question: {query}

Answer:"""

    return llm.generate(prompt)

Advanced RAG: Beyond the Basics

Problem 1: The Query-Document Mismatch

The embedding model embeds the question: "How do I reset my password?" The document chunk: "To change your login credentials, navigate to Settings > Security."

These may not be similar enough despite being semantically related. The question talks about "reset password" and the document talks about "login credentials."

Solution: HyDE (Hypothetical Document Embeddings)

def hyde_retrieve(query, llm, vector_store, embedder):
    """
    Instead of embedding the query directly,
    generate a hypothetical answer and embed that.
    """
    hypothetical_answer = llm.generate(
        f"Write a short passage that answers: {query}"
    )
    # Embed the generated answer (closer to document style)
    query_embedding = embedder.encode([hypothetical_answer])[0]
    return vector_store.search(query_embedding)

Problem 2: Chunking Loses Context

A chunk might reference "the algorithm described above" but the chunk doesn't include what was described above.

Solution: Parent Document Retrieval

Store small chunks for precise retrieval (embedding)
But return the larger parent document as context
This way retrieval is precise but context is complete

Problem 3: First Retrieved ≠ Best

A dense vector search retrieves by semantic similarity. But sometimes lexically similar documents (keyword matching) are more relevant. "GPT-4" matches "GPT-4 architecture" perfectly but might not score highest on semantic embedding.

Solution: Hybrid Search + Reranking

def hybrid_search(query, vector_store, bm25_index, top_k=20, final_k=5):
    """
    Combine dense (semantic) and sparse (keyword) retrieval,
    then rerank with a cross-encoder.
    """
    # Dense retrieval
    dense_results = vector_store.search(query, top_k=top_k)

    # Sparse retrieval (BM25 keyword matching)
    bm25_results = bm25_index.get_top_n(query.split(), corpus, n=top_k)

    # Combine (Reciprocal Rank Fusion)
    from langchain.retrievers import EnsembleRetriever
    # ... RRF combines ranks from both sources

    # Rerank with cross-encoder (much more accurate, but slower)
    from sentence_transformers import CrossEncoder
    reranker = CrossEncoder("BAAI/bge-reranker-large")
    pairs = [(query, doc) for doc, _ in combined_results]
    scores = reranker.predict(pairs)

    # Return top-k by reranking score
    reranked = sorted(zip(combined_results, scores), key=lambda x: x[1], reverse=True)
    return [r[0] for r in reranked[:final_k]]

Cross-encoder vs. bi-encoder:

Bi-encoder (standard embedding): embeds query and documents separately, uses cosine similarity. Fast at retrieval, less accurate.
Cross-encoder (reranker): takes (query, document) pair and predicts relevance score jointly. Slow (O(n) per query), but much more accurate. Use for reranking top-k.

Problem 4: Multi-Hop Reasoning

Question: "What is the capital of the country where Einstein was born?"

Step 1: Find where Einstein was born → Germany Step 2: Find the capital of Germany → Berlin

Single-step RAG can't handle this. Solutions:

Iterative RAG: Run retrieval, look at results, decide if more retrieval is needed. Decomposition: LLM decomposes the question into sub-questions, retrieves for each. Graph RAG (Microsoft): Build a knowledge graph from documents. Multi-hop = graph traversal.

RAG Evaluation

How do you know if your RAG system is working?

RAGAS (RAG Assessment):

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

scores = evaluate(
    dataset,  # questions, contexts, answers, ground_truths
    metrics=[
        faithfulness,        # Is the answer grounded in the retrieved context?
        answer_relevancy,    # Is the answer relevant to the question?
        context_precision,   # Are retrieved docs actually relevant?
    ]
)

Key metrics:

Faithfulness: Does the answer only use information from retrieved context? (Hallucination check)
Answer Relevancy: Does the answer actually address the question?
Context Recall: Was the relevant information retrieved at all?
Context Precision: What fraction of retrieved content is actually relevant?

Interview Corner Cases — RAG 🎯

"When should you use RAG vs. fine-tuning?" → RAG for: dynamic information (changes frequently), very specific private knowledge, needing to cite sources, reducing hallucination on factual queries. Fine-tuning for: changing the model's behavior/style/format, adapting to a specific writing style or response format, domain jargon that appears everywhere in outputs.
"What is the lost-in-the-middle problem?" → LLMs tend to pay more attention to information at the beginning and end of the context, and less to information in the middle. For RAG with many retrieved documents, critical information in the middle may be ignored. Mitigation: fewer documents, better ordering (most relevant first and last), or fine-tune on "middle-reading" tasks.
"How do you handle very large documents in RAG?" → Options: (1) Chunk them (standard approach). (2) Use a long-context model (128K context) and put the full document in context — expensive but avoids chunking artifacts. (3) Hierarchical summarization: summarize sections, then summarize summaries.
"What is RAG hallucination, and is it different from standard LLM hallucination?" → RAG can reduce hallucination but not eliminate it. RAG-specific hallucinations: (1) Retrieved context is retrieved but wrong information is used. (2) Context says X but model generates Y (faithfulness failure). (3) Ambiguous context → model chooses one interpretation. Standard evaluation with faithfulness metrics catches these.
"What is the difference between RAG and a 'long context' approach?" → Long context: stuff everything into the prompt (requires huge context window, expensive). RAG: selectively retrieve the most relevant information (cheaper but may miss something). Empirically: for very long docs, RAG + long context works better than either alone (retrieve the right sections, put in full context).