Chapter 4 — Pre-training at Scale

Part 1: Data — The Most Underrated Part of LLMs

"Data is the most important thing in training an LLM. Everyone obsesses over architecture, but the real secret sauce is data curation." — Widely agreed upon in the field after Chinchilla and Phi showed quality > quantity

The Data Pipeline: Soup to Nuts

Pre-training data doesn't start as clean text. It starts as the raw internet. Here's what it takes to turn web crawl data into a training corpus:

Raw Web Crawl (petabytes)
        ↓
1. Language Identification
        ↓
2. URL/Domain Filtering
        ↓
3. Text Extraction (remove HTML, boilerplate)
        ↓
4. Quality Filtering
        ↓
5. Deduplication
        ↓
6. Tokenization + Packing
        ↓
Training-Ready Dataset (~1-15T tokens)

Step 1: Common Crawl — The Raw Material

Most LLMs start from Common Crawl — a nonprofit that crawls the web every month and makes the data publicly available. Since 2008, they've accumulated petabytes of web data.

A monthly CC crawl contains ~3 billion pages and ~100TB of uncompressed text.

Raw CC text is terrible quality: broken HTML, machine-generated spam, porn, malware, repetitive forum threads, scraped book OCR errors, etc. You cannot train a good LLM on raw CC. Every major model does extensive filtering.

Step 2: Language Identification

Use a classifier (typically fastText langid — a tiny, fast language detector) to identify the language of each document.

# fastText language identification
import fasttext
model = fasttext.load_model('lid.176.bin')
lang, confidence = model.predict("This is an English sentence.")
# → ('__label__en', 0.9997)

For English-focused models, you might keep documents with:

English confidence > 0.65 (fairly aggressive, keeps ~70% of web content)
Or keep all languages and rely on later filtering

Interview corner case 🎯: "If you only keep English text, what's the downside?" — Models train on language patterns. If you want good multilingual performance, you need multilingual pre-training data. But there's a tradeoff: adding more languages with the same compute budget means the model sees less of each language. The "English-first" approach maximizes English performance at the cost of other languages.

Step 3: Quality Filtering

This is the most important step, and where different labs diverge most.

Heuristic Filters (Rule-Based)

def passes_heuristic_filter(doc):
    """
    Apply fast, interpretable quality filters.
    Returns True if the document should be kept.
    """
    words = doc.split()
    chars = list(doc)

    # 1. Length filter: too short = likely low quality
    if len(words) < 50:
        return False

    # 2. Symbol-to-word ratio: too many special chars = spam or code dump
    symbol_chars = sum(1 for c in chars if c in '#@{}[]<>|\\')
    if symbol_chars / max(len(chars), 1) > 0.1:
        return False

    # 3. Digit ratio: too many numbers = tables, boilerplate
    digit_ratio = sum(1 for c in chars if c.isdigit()) / max(len(chars), 1)
    if digit_ratio > 0.15:
        return False

    # 4. Bullet-point ratio: too many bullets = low-prose content
    bullet_lines = sum(1 for line in doc.split('\n') if line.strip().startswith(('•', '-', '*')))
    total_lines = max(doc.count('\n'), 1)
    if bullet_lines / total_lines > 0.9:
        return False

    # 5. Unique word ratio: very low = repetitive spam
    unique_ratio = len(set(words)) / max(len(words), 1)
    if unique_ratio < 0.1:
        return False

    # 6. GPT-2 perplexity filter (Gopher, LLaMA)
    # Keep documents with perplexity in [60, 10000] — excludes both too-simple and garbled text
    # Requires a trained reference model

    return True

Classifier-Based Quality Filtering (C4, LLaMA, etc.)

C4's approach (used by T5): Train a classifier on Wikipedia + bad text, keep documents that look more like Wikipedia.

LLaMA 1's approach: Train a linear classifier using fastText on a positive set (Wikipedia + books) vs. negative set (random CC). Keep documents with score > threshold.

Gopher/Chinchilla filtering rules (DeepMind):

Document must have between 50 and 100,000 words
Mean word length between 3-10 characters
<30% of lines ending with ellipsis
At least 80% of words containing at least one alphabetic character

Step 4: Deduplication — The Hidden Performance Killer

Training on duplicate data is worse than training on less unique data. Duplicates cause:

Memorization over generalization: The model learns to reproduce duplicates perfectly instead of generalizing
Evaluation contamination: If your evaluation data appears in training data (exact matches), benchmark scores are inflated
Wasted compute: Why train twice on the same document?

MinHash LSH Deduplication

Used by LLaMA, ROOTS, RedPajama.

# Fuzzy deduplication: find near-duplicate documents

from datasketch import MinHash, MinHashLSH

def get_minhash(text, num_perm=128):
    """Create MinHash signature for a document."""
    m = MinHash(num_perm=num_perm)
    # Hash all n-grams (n=5 words) from the document
    words = text.lower().split()
    for i in range(len(words) - 5 + 1):
        ngram = ' '.join(words[i:i+5])
        m.update(ngram.encode('utf-8'))
    return m

# Build LSH index
lsh = MinHashLSH(threshold=0.7, num_perm=128)  # 0.7 = 70% similar → duplicate

# Process all documents
seen = set()
for doc_id, doc in enumerate(documents):
    m = get_minhash(doc)
    result = lsh.query(m)
    if result:  # Near-duplicate found
        seen.add(doc_id)
    else:
        lsh.insert(f"doc_{doc_id}", m)

Exact deduplication: Use MD5/SHA hashes of document (or paragraph) content. Cheaper than MinHash but misses near-duplicates.

The scale: LLaMA 1 started with 5T tokens of raw CC and ended with ~1T after filtering and deduplication — 80% was removed!

Step 5: Dataset Composition — Mixing Sources

Pre-training data is not just web text. Most modern LLMs train on a mixture:

Source	LLaMA 2	Falcon	RedPajama
Common Crawl (web)	67%	80%	67%
GitHub (code)	8%	5%	4%
Wikipedia	4%	3%	4%
Books	4%	-	5%
ArXiv	2.5%	-	5%
StackExchange	2%	2%	2%
Other	12.5%	10%	13%

Why code? Training on code dramatically improves reasoning and structured output generation, even for non-coding tasks. The structured, logical nature of code seems to improve the model's ability to reason step-by-step.

Why Wikipedia? High-quality, fact-dense, encyclopedic text. Even though it's a tiny fraction of the web, it's heavily upsampled.

Interview corner case 🎯: "Why does training on code improve math and reasoning, even for a model that's never asked about code?" — Code is fundamentally about explicit step-by-step reasoning. "If X then Y, else Z" is a formal reasoning pattern. Following code execution requires tracking state. Mathematical proofs look like code. The model learns structured logical reasoning from code that transfers to math and analytical tasks.

Step 6: Tokenization and Packing

Once you have clean text, you tokenize it and pack tokens into fixed-length sequences for efficient training.

def pack_sequences(tokenized_docs, seq_len=2048, sep_token=2):
    """
    Pack multiple tokenized documents into fixed-length sequences.
    Add EOS token between documents.
    No padding — pack densely.
    """
    buffer = []
    packed_sequences = []

    for doc_tokens in tokenized_docs:
        # Add end-of-sequence token
        buffer.extend(doc_tokens + [sep_token])

        # Emit full sequences
        while len(buffer) >= seq_len:
            packed_sequences.append(buffer[:seq_len])
            buffer = buffer[seq_len:]

    return packed_sequences

Why pack instead of pad? Padding wastes compute — you're running attention over PAD tokens that contribute nothing. Dense packing means 100% of compute is on real tokens.

The EOS token: Separates documents in the packed sequence. The model learns that an EOS token means "the previous document ended, the next document starts". This prevents the attention mechanism from making spurious connections between the end of document A and the start of document B.

Open Datasets You Can Use Right Now

Dataset	Size	Quality	Use Case
The Pile (EleutherAI)	825GB	Good	General pretraining
RedPajama	1.2T tokens	Good	LLaMA recreation
ROOTS	1.6T tokens	Good	Multilingual
Dolma (Allen AI)	3T tokens	Very good	Best open option
FineWeb	15T tokens	Excellent	Best for English
StarCoder	250B tokens	Excellent	Code training
TinyStories	2GB	Great for learning	Small model training

For your experiments: Start with TinyStories — 2M+ synthetic children's stories, ~475MB, perfect for training a 10M-100M parameter model on a single GPU in a few hours.

# Download TinyStories
wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt
wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt

Data Flywheel and Synthetic Data

A growing trend (2023–2025): using strong models to generate synthetic training data for weaker models.

Alpaca: Took text-davinci-003 to generate 52,000 instruction-following examples in the same format as 175 human examples. Cost: ~$500. Quality: surprisingly good.

Phi-1: Used GPT-4 to generate Python programming exercises with explanations ("textbook quality" code). A 1.3B model trained on this outperformed 7B models trained on raw web code.

WizardLM: Uses "Evol-Instruct" — iteratively asks ChatGPT to make instructions more complex. Each step: "Make this instruction more complex by adding constraints / deepen the difficulty level / add a twist."

Risks of synthetic data:

Model collapse: training on model-generated data → model that generates more of the same → capability degradation over generations
Hallucination propagation: if the teacher model hallucinates facts, the student model learns them
Distribution narrowing: the student learns only what the teacher knows well

Interview corner case 🎯: "What is model collapse and why is it a concern?" — If you train generation N+1 on outputs from generation N, quality degrades over generations. The distribution narrows — the model memorizes what's common in the synthetic data and loses diversity. This is analogous to photocopying a photocopy repeatedly. The field is studying this problem for the "open internet data is running out" scenario.