Chapter 7 — Advanced Topics
Part 1: RAG — Retrieval-Augmented Generation
RAG is the most widely deployed LLM technique in production. When the answer isn't in the model's weights, RAG fetches it from external knowledge. This chapter builds the full picture — from chunking strategy to production gotchas.
Why LLMs Hallucinate and Why RAG Helps
LLMs encode world knowledge in their weights during training. But this creates two problems:
- Stale knowledge: Training data has a cutoff. GPT-4's knowledge ends in April 2023. Ask it about recent events → hallucination or refusal.
- Domain knowledge gaps: Your company's internal documentation, your medical records, your codebase — none of these were in training data. The model can't know them.
RAG solves this by giving the model access to a searchable knowledge base at inference time:
User question
↓
[Retrieve relevant documents from knowledge base]
↓
Inject documents into the prompt as context
↓
LLM generates answer based on retrieved context
The LLM is now reasoning over retrieved facts rather than its memorized knowledge. Much less hallucination. Always up-to-date.
The RAG Pipeline: Every Component
The RAG pipeline has two phases: an offline indexing phase (done once) and an online retrieval phase (done per query).
flowchart TD
subgraph Offline["Offline — Build the Index (run once)"]
D[Raw Documents] --> CH[Chunk into passages]
CH --> EM[Embed with embedding model]
EM --> VS[(Vector Store)]
end
subgraph Online["Online — Answer a Query (per request)"]
Q([User Query]) --> QE[Embed query]
QE --> RET[Retrieve top-k by similarity]
VS --> RET
RET --> CTX[Augment prompt with chunks]
CTX --> LLM[LLM generates grounded answer]
LLM --> ANS([Response])
end
style Offline fill:#fafaf9,stroke:#e7e5e4
style Online fill:#fafaf9,stroke:#e7e5e4
style VS fill:#f5f5f4,stroke:#a8a29e
Step 1: Document Processing (Offline)
from langchain.text_splitter import RecursiveCharacterTextSplitter
def process_documents(documents):
"""
Split documents into chunks for embedding.
Chunking strategy matters enormously:
- Too small: each chunk lacks context
- Too large: relevant info is diluted; hits context limits
- Overlap: ensures information at chunk boundaries isn't missed
"""
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # Characters per chunk (experiment with this!)
chunk_overlap=64, # Characters of overlap between adjacent chunks
separators=["\n\n", "\n", ". ", " ", ""], # Split order of preference
)
return splitter.split_documents(documents)
Chunking strategies:
- Fixed size: Simple, predictable.
chunk_size=512 tokenswith overlap. - Sentence-based: Split on sentence boundaries. More semantically coherent.
- Semantic chunking: Use embeddings to find natural topic boundaries. Best quality, slowest.
- Hierarchical: Store both sentence-level and paragraph-level chunks. Retrieve fine, return coarse.
Step 2: Embedding (Offline)
Convert each chunk to a dense vector using an embedding model:
from sentence_transformers import SentenceTransformer
import numpy as np
# Load an embedding model
# Best open-source: bge-large-en-v1.5, e5-large-v2, nomic-embed-text
embedder = SentenceTransformer("BAAI/bge-large-en-v1.5")
def embed_chunks(chunks):
texts = [chunk.page_content for chunk in chunks]
# Output: (n_chunks, embedding_dim) — e.g., (1000, 1024)
embeddings = embedder.encode(texts, normalize_embeddings=True)
return embeddings
Key embedding models:
| Model | Dim | Context | Quality |
|---|---|---|---|
BAAI/bge-large-en-v1.5 | 1024 | 512 tokens | Excellent |
text-embedding-3-small (OpenAI) | 1536 | 8191 tokens | Very good, paid |
nomic-embed-text | 768 | 8192 tokens | Good, open |
e5-large-v2 | 1024 | 512 tokens | Very good |
Step 3: Vector Store (Offline)
Store embeddings in a vector database that supports similarity search:
import faiss
import numpy as np
class SimpleVectorStore:
def __init__(self, dimension):
# FAISS: Facebook AI Similarity Search
# IndexFlatIP: exact inner product search (for normalized vectors = cosine similarity)
self.index = faiss.IndexFlatIP(dimension)
self.chunks = []
def add(self, chunks, embeddings):
"""Add chunks and their embeddings to the store."""
self.index.add(embeddings.astype(np.float32))
self.chunks.extend(chunks)
def search(self, query_embedding, top_k=5):
"""Find top_k most similar chunks to the query."""
scores, indices = self.index.search(
query_embedding.reshape(1, -1).astype(np.float32),
top_k
)
results = [(self.chunks[i], scores[0][j])
for j, i in enumerate(indices[0]) if i >= 0]
return results
Vector databases for production:
| Database | Hosted | Open Source | Best For |
|---|---|---|---|
| Chroma | Local | ✅ | Dev/prototyping |
| FAISS | Local | ✅ | CPU-efficient search |
| Pinecone | Cloud | ❌ | Managed, scalable |
| Weaviate | Both | ✅ | GraphQL, filters |
| Qdrant | Both | ✅ | Fast, Rust-based |
| pgvector | Self-host | ✅ | Postgres extension, familiar |
Step 4: Query & Retrieve (Online)
def retrieve(query, vector_store, embedder, top_k=5):
"""Embed query and retrieve most relevant chunks."""
query_embedding = embedder.encode([query], normalize_embeddings=True)[0]
results = vector_store.search(query_embedding, top_k=top_k)
return results
Step 5: Generate with Context (Online)
def rag_generate(query, retrieved_chunks, llm):
"""Build prompt with retrieved context and generate answer."""
# Format retrieved chunks as context
context = "\n\n".join([
f"[Document {i+1}]\n{chunk.page_content}"
for i, (chunk, score) in enumerate(retrieved_chunks)
])
# Prompt engineering matters here
prompt = f"""You are a helpful assistant. Answer the question based ONLY on the provided context.
If the context doesn't contain the answer, say "I don't have that information."
Context:
{context}
Question: {query}
Answer:"""
return llm.generate(prompt)
Advanced RAG: Beyond the Basics
Problem 1: The Query-Document Mismatch
The embedding model embeds the question: "How do I reset my password?" The document chunk: "To change your login credentials, navigate to Settings > Security."
These may not be similar enough despite being semantically related. The question talks about "reset password" and the document talks about "login credentials."
Solution: HyDE (Hypothetical Document Embeddings)
def hyde_retrieve(query, llm, vector_store, embedder):
"""
Instead of embedding the query directly,
generate a hypothetical answer and embed that.
"""
hypothetical_answer = llm.generate(
f"Write a short passage that answers: {query}"
)
# Embed the generated answer (closer to document style)
query_embedding = embedder.encode([hypothetical_answer])[0]
return vector_store.search(query_embedding)
Problem 2: Chunking Loses Context
A chunk might reference "the algorithm described above" but the chunk doesn't include what was described above.
Solution: Parent Document Retrieval
- Store small chunks for precise retrieval (embedding)
- But return the larger parent document as context
- This way retrieval is precise but context is complete
Problem 3: First Retrieved ≠ Best
A dense vector search retrieves by semantic similarity. But sometimes lexically similar documents (keyword matching) are more relevant. "GPT-4" matches "GPT-4 architecture" perfectly but might not score highest on semantic embedding.
Solution: Hybrid Search + Reranking
def hybrid_search(query, vector_store, bm25_index, top_k=20, final_k=5):
"""
Combine dense (semantic) and sparse (keyword) retrieval,
then rerank with a cross-encoder.
"""
# Dense retrieval
dense_results = vector_store.search(query, top_k=top_k)
# Sparse retrieval (BM25 keyword matching)
bm25_results = bm25_index.get_top_n(query.split(), corpus, n=top_k)
# Combine (Reciprocal Rank Fusion)
from langchain.retrievers import EnsembleRetriever
# ... RRF combines ranks from both sources
# Rerank with cross-encoder (much more accurate, but slower)
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-large")
pairs = [(query, doc) for doc, _ in combined_results]
scores = reranker.predict(pairs)
# Return top-k by reranking score
reranked = sorted(zip(combined_results, scores), key=lambda x: x[1], reverse=True)
return [r[0] for r in reranked[:final_k]]
Cross-encoder vs. bi-encoder:
- Bi-encoder (standard embedding): embeds query and documents separately, uses cosine similarity. Fast at retrieval, less accurate.
- Cross-encoder (reranker): takes (query, document) pair and predicts relevance score jointly. Slow (O(n) per query), but much more accurate. Use for reranking top-k.
Problem 4: Multi-Hop Reasoning
Question: "What is the capital of the country where Einstein was born?"
Step 1: Find where Einstein was born → Germany Step 2: Find the capital of Germany → Berlin
Single-step RAG can't handle this. Solutions:
Iterative RAG: Run retrieval, look at results, decide if more retrieval is needed. Decomposition: LLM decomposes the question into sub-questions, retrieves for each. Graph RAG (Microsoft): Build a knowledge graph from documents. Multi-hop = graph traversal.
RAG Evaluation
How do you know if your RAG system is working?
RAGAS (RAG Assessment):
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
scores = evaluate(
dataset, # questions, contexts, answers, ground_truths
metrics=[
faithfulness, # Is the answer grounded in the retrieved context?
answer_relevancy, # Is the answer relevant to the question?
context_precision, # Are retrieved docs actually relevant?
]
)
Key metrics:
- Faithfulness: Does the answer only use information from retrieved context? (Hallucination check)
- Answer Relevancy: Does the answer actually address the question?
- Context Recall: Was the relevant information retrieved at all?
- Context Precision: What fraction of retrieved content is actually relevant?
Interview Corner Cases — RAG 🎯
- "When should you use RAG vs. fine-tuning?" → RAG for: dynamic information (changes frequently), very specific private knowledge, needing to cite sources, reducing hallucination on factual queries. Fine-tuning for: changing the model's behavior/style/format, adapting to a specific writing style or response format, domain jargon that appears everywhere in outputs.
- "What is the lost-in-the-middle problem?" → LLMs tend to pay more attention to information at the beginning and end of the context, and less to information in the middle. For RAG with many retrieved documents, critical information in the middle may be ignored. Mitigation: fewer documents, better ordering (most relevant first and last), or fine-tune on "middle-reading" tasks.
- "How do you handle very large documents in RAG?" → Options: (1) Chunk them (standard approach). (2) Use a long-context model (128K context) and put the full document in context — expensive but avoids chunking artifacts. (3) Hierarchical summarization: summarize sections, then summarize summaries.
- "What is RAG hallucination, and is it different from standard LLM hallucination?" → RAG can reduce hallucination but not eliminate it. RAG-specific hallucinations: (1) Retrieved context is retrieved but wrong information is used. (2) Context says X but model generates Y (faithfulness failure). (3) Ambiguous context → model chooses one interpretation. Standard evaluation with faithfulness metrics catches these.
- "What is the difference between RAG and a 'long context' approach?" → Long context: stuff everything into the prompt (requires huge context window, expensive). RAG: selectively retrieve the most relevant information (cheaper but may miss something). Empirically: for very long docs, RAG + long context works better than either alone (retrieve the right sections, put in full context).