Chapter 4 — Pre-training at Scale
Part 1: Data — The Most Underrated Part of LLMs
"Data is the most important thing in training an LLM. Everyone obsesses over architecture, but the real secret sauce is data curation." — Widely agreed upon in the field after Chinchilla and Phi showed quality > quantity
The Data Pipeline: Soup to Nuts
Pre-training data doesn't start as clean text. It starts as the raw internet. Here's what it takes to turn web crawl data into a training corpus:
Raw Web Crawl (petabytes)
↓
1. Language Identification
↓
2. URL/Domain Filtering
↓
3. Text Extraction (remove HTML, boilerplate)
↓
4. Quality Filtering
↓
5. Deduplication
↓
6. Tokenization + Packing
↓
Training-Ready Dataset (~1-15T tokens)
Step 1: Common Crawl — The Raw Material
Most LLMs start from Common Crawl — a nonprofit that crawls the web every month and makes the data publicly available. Since 2008, they've accumulated petabytes of web data.
A monthly CC crawl contains ~3 billion pages and ~100TB of uncompressed text.
Raw CC text is terrible quality: broken HTML, machine-generated spam, porn, malware, repetitive forum threads, scraped book OCR errors, etc. You cannot train a good LLM on raw CC. Every major model does extensive filtering.
Step 2: Language Identification
Use a classifier (typically fastText langid — a tiny, fast language detector) to identify the language of each document.
# fastText language identification
import fasttext
model = fasttext.load_model('lid.176.bin')
lang, confidence = model.predict("This is an English sentence.")
# → ('__label__en', 0.9997)
For English-focused models, you might keep documents with:
- English confidence > 0.65 (fairly aggressive, keeps ~70% of web content)
- Or keep all languages and rely on later filtering
Interview corner case 🎯: "If you only keep English text, what's the downside?" — Models train on language patterns. If you want good multilingual performance, you need multilingual pre-training data. But there's a tradeoff: adding more languages with the same compute budget means the model sees less of each language. The "English-first" approach maximizes English performance at the cost of other languages.
Step 3: Quality Filtering
This is the most important step, and where different labs diverge most.
Heuristic Filters (Rule-Based)
def passes_heuristic_filter(doc):
"""
Apply fast, interpretable quality filters.
Returns True if the document should be kept.
"""
words = doc.split()
chars = list(doc)
# 1. Length filter: too short = likely low quality
if len(words) < 50:
return False
# 2. Symbol-to-word ratio: too many special chars = spam or code dump
symbol_chars = sum(1 for c in chars if c in '#@{}[]<>|\\')
if symbol_chars / max(len(chars), 1) > 0.1:
return False
# 3. Digit ratio: too many numbers = tables, boilerplate
digit_ratio = sum(1 for c in chars if c.isdigit()) / max(len(chars), 1)
if digit_ratio > 0.15:
return False
# 4. Bullet-point ratio: too many bullets = low-prose content
bullet_lines = sum(1 for line in doc.split('\n') if line.strip().startswith(('•', '-', '*')))
total_lines = max(doc.count('\n'), 1)
if bullet_lines / total_lines > 0.9:
return False
# 5. Unique word ratio: very low = repetitive spam
unique_ratio = len(set(words)) / max(len(words), 1)
if unique_ratio < 0.1:
return False
# 6. GPT-2 perplexity filter (Gopher, LLaMA)
# Keep documents with perplexity in [60, 10000] — excludes both too-simple and garbled text
# Requires a trained reference model
return True
Classifier-Based Quality Filtering (C4, LLaMA, etc.)
C4's approach (used by T5): Train a classifier on Wikipedia + bad text, keep documents that look more like Wikipedia.
LLaMA 1's approach: Train a linear classifier using fastText on a positive set (Wikipedia + books) vs. negative set (random CC). Keep documents with score > threshold.
Gopher/Chinchilla filtering rules (DeepMind):
- Document must have between 50 and 100,000 words
- Mean word length between 3-10 characters
- <30% of lines ending with ellipsis
- At least 80% of words containing at least one alphabetic character
Step 4: Deduplication — The Hidden Performance Killer
Training on duplicate data is worse than training on less unique data. Duplicates cause:
- Memorization over generalization: The model learns to reproduce duplicates perfectly instead of generalizing
- Evaluation contamination: If your evaluation data appears in training data (exact matches), benchmark scores are inflated
- Wasted compute: Why train twice on the same document?
MinHash LSH Deduplication
Used by LLaMA, ROOTS, RedPajama.
# Fuzzy deduplication: find near-duplicate documents
from datasketch import MinHash, MinHashLSH
def get_minhash(text, num_perm=128):
"""Create MinHash signature for a document."""
m = MinHash(num_perm=num_perm)
# Hash all n-grams (n=5 words) from the document
words = text.lower().split()
for i in range(len(words) - 5 + 1):
ngram = ' '.join(words[i:i+5])
m.update(ngram.encode('utf-8'))
return m
# Build LSH index
lsh = MinHashLSH(threshold=0.7, num_perm=128) # 0.7 = 70% similar → duplicate
# Process all documents
seen = set()
for doc_id, doc in enumerate(documents):
m = get_minhash(doc)
result = lsh.query(m)
if result: # Near-duplicate found
seen.add(doc_id)
else:
lsh.insert(f"doc_{doc_id}", m)
Exact deduplication: Use MD5/SHA hashes of document (or paragraph) content. Cheaper than MinHash but misses near-duplicates.
The scale: LLaMA 1 started with 5T tokens of raw CC and ended with ~1T after filtering and deduplication — 80% was removed!
Step 5: Dataset Composition — Mixing Sources
Pre-training data is not just web text. Most modern LLMs train on a mixture:
| Source | LLaMA 2 | Falcon | RedPajama |
|---|---|---|---|
| Common Crawl (web) | 67% | 80% | 67% |
| GitHub (code) | 8% | 5% | 4% |
| Wikipedia | 4% | 3% | 4% |
| Books | 4% | - | 5% |
| ArXiv | 2.5% | - | 5% |
| StackExchange | 2% | 2% | 2% |
| Other | 12.5% | 10% | 13% |
Why code? Training on code dramatically improves reasoning and structured output generation, even for non-coding tasks. The structured, logical nature of code seems to improve the model's ability to reason step-by-step.
Why Wikipedia? High-quality, fact-dense, encyclopedic text. Even though it's a tiny fraction of the web, it's heavily upsampled.
Interview corner case 🎯: "Why does training on code improve math and reasoning, even for a model that's never asked about code?" — Code is fundamentally about explicit step-by-step reasoning. "If X then Y, else Z" is a formal reasoning pattern. Following code execution requires tracking state. Mathematical proofs look like code. The model learns structured logical reasoning from code that transfers to math and analytical tasks.
Step 6: Tokenization and Packing
Once you have clean text, you tokenize it and pack tokens into fixed-length sequences for efficient training.
def pack_sequences(tokenized_docs, seq_len=2048, sep_token=2):
"""
Pack multiple tokenized documents into fixed-length sequences.
Add EOS token between documents.
No padding — pack densely.
"""
buffer = []
packed_sequences = []
for doc_tokens in tokenized_docs:
# Add end-of-sequence token
buffer.extend(doc_tokens + [sep_token])
# Emit full sequences
while len(buffer) >= seq_len:
packed_sequences.append(buffer[:seq_len])
buffer = buffer[seq_len:]
return packed_sequences
Why pack instead of pad? Padding wastes compute — you're running attention over PAD tokens that contribute nothing. Dense packing means 100% of compute is on real tokens.
The EOS token: Separates documents in the packed sequence. The model learns that an EOS token means "the previous document ended, the next document starts". This prevents the attention mechanism from making spurious connections between the end of document A and the start of document B.
Open Datasets You Can Use Right Now
| Dataset | Size | Quality | Use Case |
|---|---|---|---|
| The Pile (EleutherAI) | 825GB | Good | General pretraining |
| RedPajama | 1.2T tokens | Good | LLaMA recreation |
| ROOTS | 1.6T tokens | Good | Multilingual |
| Dolma (Allen AI) | 3T tokens | Very good | Best open option |
| FineWeb | 15T tokens | Excellent | Best for English |
| StarCoder | 250B tokens | Excellent | Code training |
| TinyStories | 2GB | Great for learning | Small model training |
For your experiments: Start with TinyStories — 2M+ synthetic children's stories, ~475MB, perfect for training a 10M-100M parameter model on a single GPU in a few hours.
# Download TinyStories
wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt
wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt
Data Flywheel and Synthetic Data
A growing trend (2023–2025): using strong models to generate synthetic training data for weaker models.
Alpaca: Took text-davinci-003 to generate 52,000 instruction-following examples in the same format as 175 human examples. Cost: ~$500. Quality: surprisingly good.
Phi-1: Used GPT-4 to generate Python programming exercises with explanations ("textbook quality" code). A 1.3B model trained on this outperformed 7B models trained on raw web code.
WizardLM: Uses "Evol-Instruct" — iteratively asks ChatGPT to make instructions more complex. Each step: "Make this instruction more complex by adding constraints / deepen the difficulty level / add a twist."
Risks of synthetic data:
- Model collapse: training on model-generated data → model that generates more of the same → capability degradation over generations
- Hallucination propagation: if the teacher model hallucinates facts, the student model learns them
- Distribution narrowing: the student learns only what the teacher knows well
Interview corner case 🎯: "What is model collapse and why is it a concern?" — If you train generation N+1 on outputs from generation N, quality degrades over generations. The distribution narrows — the model memorizes what's common in the synthetic data and loses diversity. This is analogous to photocopying a photocopy repeatedly. The field is studying this problem for the "open internet data is running out" scenario.