Chapter 0 — A Brief History of Language Models

From Counting Words to Models That Write Like Humans

Read time: ~45 minutes Purpose: Build the "why" before the "how". Every architectural choice in modern LLMs makes sense when you understand what problem it was solving.

The Central Problem: What Comes Next?

Every language model — from the most primitive to GPT-4 — is trying to answer the same question:

Given what I've seen so far, what word (or token) comes next?

That's it. Everything else is engineering.

If you can predict the next word well enough, across enough contexts, you've implicitly learned grammar, facts, reasoning, style, and meaning. The magic of LLMs is that this single objective, applied at enormous scale, produces something that looks like understanding.

Let's trace how we got here.


Act 1: The Counting Era — N-gram Models (1948–2013)

The Idea

Claude Shannon, in his 1948 paper "A Mathematical Theory of Communication", asked: how much information does language contain? He ran a beautiful experiment: he showed people partial sentences and asked them to guess the next letter. People were surprisingly good at it — which told him that language is highly predictable and structured.

The simplest computational version of this insight is the n-gram model:

The probability of the next word depends only on the last n-1 words.

For a bigram (n=2) model:

$$P(\text{"world"} \mid \text{"hello"}) = \frac{\text{count("hello world")}}{\text{count("hello")}}$$

For a trigram (n=3):

$$P(\text{"world"} \mid \text{"hello beautiful"}) = \frac{\text{count("hello beautiful world")}}{\text{count("hello beautiful")}}$$

You just count how often word sequences appear in a training corpus, then divide to get probabilities.

The Intuition

Imagine you're in a library, and your job is to predict what word comes next in any book. Your strategy: keep a massive dictionary of every 2-word and 3-word phrase you've seen in every book, with their frequencies.

"The cat sat on the ___" → you've seen "the mat" many times → predict "mat".

This actually works decently well. N-grams powered early Google autocomplete, speech recognition, and machine translation for over a decade.

The Fatal Flaw

N-grams have two killer problems:

1. The exponential vocabulary explosion. If your vocabulary has V=100,000 words, then:

  • Bigram table: V² = 10 billion entries
  • Trigram table: V³ = 1 quadrillion entries

You can't store that. In practice, you use at most 5-grams, meaning the model has no memory beyond 5 words.

2. The sparsity problem. Most reasonable 5-word sequences never appear in your training data. "The purple hippopotamus ate calculus" is a perfectly grammatical sentence, but you've never seen it. Your model assigns it probability zero — it has never generalized, only memorized.

You can use "smoothing" (add small counts to unseen sequences) but it's a hack, and it doesn't scale.

The core insight n-grams miss: words have meaning, and meaning generalizes. "I ate the apple" and "I consumed the apple" should have similar predictions for what comes next, because "ate" and "consumed" mean the same thing. N-grams treat every word as a completely independent symbol — they have no concept of similarity.

Interview Corner Case 🎯: "What is the curse of dimensionality in NLP?" — The exponential growth of possible n-gram sequences with vocabulary size. This is why n-grams don't scale to long contexts.


Act 2: Learning Meaning — Word2Vec and Neural Embeddings (2013)

The Breakthrough

In 2013, Tomas Mikolov at Google published Word2Vec, and it changed everything. The insight was simple but profound:

Instead of treating words as discrete symbols, represent them as continuous vectors (points in high-dimensional space).

The model learned these vectors by training on a simple task: predict the surrounding words given a center word (or vice versa), on billions of examples from Wikipedia and Google News.

What emerged was astonishing: the learned vectors captured meaning.

$$\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}$$
$$\vec{\text{Paris}} - \vec{\text{France}} + \vec{\text{Germany}} \approx \vec{\text{Berlin}}$$
$$\vec{\text{walking}} - \vec{\text{walk}} \approx \vec{\text{swimming}} - \vec{\text{swim}}$$

This is called the word analogy property, and it emerged without anyone programming it. The model discovered these geometric relationships purely from co-occurrence patterns in text.

The Intuition: Why Do Vectors Capture Meaning?

Think of it this way. If you have a good model of "what words appear near 'king'", you'll see: throne, crown, castle, reign, queen, prince... If you have a model of "what words appear near 'queen'", you'll see: throne, crown, castle, reign, king, princess...

The overlap is almost identical! So the vectors for "king" and "queen" end up pointing in similar directions in the 300-dimensional space — with the "gender axis" being the main difference.

The Distributional Hypothesis: "A word is known by the company it keeps." — J.R. Firth, 1957. This is the foundation of all word embeddings.

What Word2Vec Could Not Do

Word2Vec gave you fixed, static embeddings. "Apple" always maps to the same vector, whether you're talking about the fruit or the company. It captures average meaning but not context-dependent meaning.

And it still doesn't solve the language modeling problem — telling you what word comes next, given a sequence of words.

Interview Corner Case 🎯: "What is the difference between word2vec and contextual embeddings?" — Word2vec: one vector per word regardless of context. Contextual embeddings (BERT, GPT): the vector for "bank" changes depending on whether nearby words include "money" or "river".


Act 3: Memory — RNNs and LSTMs (2014–2017)

Recurrent Neural Networks: Teaching Models to Read Left-to-Right

A Recurrent Neural Network (RNN) processes a sequence one element at a time, maintaining a "hidden state" — a fixed-size vector that summarizes everything seen so far:

$$h_t = f(h_{t-1},\; x_t)$$

Where $h_t$ is the hidden state at step $t$, $x_t$ is the current input, and $f$ is a learned function.

The beautiful thing: this gives the model memory. It can theoretically use the entire past to predict the next word.

RNNs trained on Shakespeare generated surprisingly poetic text (Karpathy's famous 2015 blog post "The Unreasonable Effectiveness of RNNs" — worth reading). They also powered early seq2seq machine translation.

The Vanishing Gradient Problem

But RNNs had a critical flaw: vanishing gradients.

When you train a neural network by backpropagation, you compute how the loss changes with respect to each weight. In an RNN, you have to backpropagate through time — unrolling the recurrence:

$$\frac{\partial \text{loss}}{\partial h_0} = \frac{\partial \text{loss}}{\partial h_T} \times \frac{\partial h_T}{\partial h_{T-1}} \times \cdots \times \frac{\partial h_1}{\partial h_0}$$

This is a product of T matrices. If those matrices have eigenvalues < 1 (typical), the product → 0 exponentially. Gradients vanish. The model can't learn to use information from more than ~10-20 steps ago.

Intuition: Imagine passing a message through a telephone chain of 100 people. By the end, the message is garbled. Early information doesn't reach the loss signal to influence learning.

LSTMs: Teaching Networks to Remember and Forget

The Long Short-Term Memory (LSTM), invented by Hochreiter and Schmidhuber in 1997 (but popularized 2014–2017), added "gates" — mechanisms to selectively remember and forget:

- Forget gate: decides what to throw away from memory
- Input gate: decides what new info to add to memory
- Output gate: decides what to read from memory
- Cell state: a "highway" that lets information flow unchanged across many steps

The cell state acts like a conveyor belt — information can flow across the entire sequence with only small linear interactions (additions, multiplications), preventing gradients from vanishing.

LSTMs were the state of the art for machine translation, text classification, speech recognition from ~2015 to 2017. Models like Google's Neural Machine Translation (GNMT) used stacked LSTMs with attention.

The LSTM's Ceiling

LSTMs were dramatically better than RNNs, but they still had fundamental limitations:

1. Sequential computation. You can't parallelize an RNN/LSTM — step T must wait for step T-1. Training on modern GPU clusters that want massive parallelism was inefficient.

2. Fixed-size bottleneck. The hidden state is a fixed-size vector (say 512 or 1024 dimensions). All information about the entire past must be compressed into this one vector. For long sequences, this is an absurd bottleneck.

3. Still limited memory. Even with gates, LSTMs struggled with dependencies spanning hundreds or thousands of tokens.

4. Quadratic training time. Because you process one token at a time and must compute gradients through the full sequence, training on long texts was slow.

The community needed something fundamentally different.

Interview Corner Case 🎯: "Why did the field move from LSTMs to Transformers?" Three reasons: Transformers parallelize perfectly (each token computed independently given its inputs), attention directly connects any two positions in O(1) steps (not O(T) like LSTM), and they scale much better with data and compute.


Act 4: The Revolution — "Attention Is All You Need" (2017)

The Paper

In June 2017, researchers at Google published a paper with a confident title: "Attention Is All You Need" (Vaswani et al.).

They proposed the Transformer — an architecture that dispensed with recurrence entirely. No hidden states passed through time. No sequential processing. Just one key mechanism: attention.

The core idea: instead of trying to compress the entire past into a fixed vector, let every token directly attend to every other token. Give every word a direct line to every other word.

The Attention Mechanism: The Right Intuition

The key intuition for attention is this:

You have a library of information (the sequence). At each step, you ask a query: "Which pieces of this library are most relevant to what I'm currently thinking about?" The answer lets you build a weighted average of all the information — putting the most weight on the most relevant parts.

In the transformer:

  • Query (Q): What is this token looking for?
  • Key (K): What does each token contain / advertise?
  • Value (V): What information does each token actually provide?

The dot product Q·K tells you how relevant each token is to your query. Softmax turns those scores into weights. Then you take a weighted average of the Values.

The critical properties:

  1. Direct connections: Any two tokens can interact in a single step, regardless of distance
  2. Parallelizable: All tokens are processed simultaneously — perfect for GPU
  3. Data-driven: The model learns which tokens to attend to from training data

Why Transformers Dominated

Within a year, transformers had replaced LSTMs in virtually every NLP task. The reasons compound:

They parallelize perfectly. A GPU with 10,000 cores can compute attention for all tokens simultaneously. An LSTM fundamentally cannot parallelize across the time dimension.

They scale. More data, bigger model, more compute → consistently better results. LSTMs seemed to plateau.

Attention is interpretable. You can visualize which tokens each token attends to — giving some insight into what the model is "thinking".

Interview Corner Case 🎯: "What's the attention mechanism's time complexity, and why is that a problem?" — O(n²·d) in sequence length n. At n=100K tokens, you have 10 billion attention pairs. This is the fundamental bottleneck that Flash Attention, Sliding Window Attention, and Linear Attention all try to solve.


Act 5: The Pre-training Revolution — BERT and GPT (2018)

The Key Idea: Pre-train, Then Fine-tune

Before 2018, NLP models were largely trained from scratch on each task (sentiment analysis, translation, question answering separately). You needed labeled data for each task.

2018 brought a new paradigm: pre-train a large model on a massive unlabeled text corpus, then fine-tune it on your specific task with a small labeled dataset.

This worked shockingly well. The unlabeled data (Wikipedia, books, the internet) is essentially unlimited. The labeled task-specific data can be tiny.

Two landmark models:

GPT (June 2018) — OpenAI

Generative Pre-Training. A decoder-only transformer trained with next-token prediction (causal language modeling) on BooksCorpus (~7,000 books).

Architecture: 12 transformer layers, 117M parameters, trained to predict the next word given all previous words.

Key insight: Left-to-right prediction is a great pre-training objective because it uses every token as both a supervised signal and a training example.

Fine-tuning GPT on various NLP tasks achieved state-of-the-art across 9 out of 12 tasks. The era of "just fine-tune GPT" began.

BERT (October 2018) — Google

Bidirectional Encoder Representations from Transformers. An encoder-only transformer trained with two objectives:

  1. Masked Language Modeling (MLM): mask 15% of tokens, predict them. Unlike GPT, this lets BERT see both left AND right context.
  2. Next Sentence Prediction (NSP): given two sentences, predict if they're consecutive.

BERT shattered benchmarks across virtually all NLP tasks — question answering, NLI, named entity recognition. Its bidirectional nature made it better at understanding tasks.

The Key Insight for You

GPT and BERT represent two fundamental approaches:

GPT (Decoder-only)BERT (Encoder-only)
Training objectivePredict next tokenPredict masked tokens
Attends toLeft context only (causal)Full context (bidirectional)
Best forGeneration, completionUnderstanding, classification
SuccessorsGPT-2, GPT-3, GPT-4, LLaMARoBERTa, DeBERTa, BERT-large

The reason modern LLMs (ChatGPT, Claude, LLaMA) are all decoder-only: generation requires causal attention. And it turns out that next-token prediction at scale is sufficient to develop deep understanding, even without the bidirectional objective.

Interview Corner Case 🎯: "Why don't we use BERT for generation?" — BERT uses bidirectional attention: it sees all tokens when predicting masked ones. At generation time, you're predicting tokens autoregressively, which requires you to only look left. You'd need to mask the future — at which point you've essentially turned it into a GPT. Also, NSP and MLM training objectives don't directly optimize for coherent long-form text generation.


Act 6: Scale Changes Everything — GPT-2, GPT-3, the Scaling Laws (2019–2020)

GPT-2: "Too Dangerous to Release" (2019)

OpenAI scaled GPT to 1.5 billion parameters, trained on 40GB of internet text (WebText — filtered Reddit posts). The result was text generation so convincing that OpenAI initially refused to release the full model, citing misuse concerns.

The generated text was coherent over hundreds of words, could continue a story, write news articles, and code Python. This was new.

But more importantly, GPT-2 showed something crucial: language modeling at scale was a general-purpose intelligence task. You didn't need task-specific fine-tuning for many things — the model had simply learned enough about the world.

GPT-3: The Emergent Abilities Shock (2020)

175 billion parameters. 300 billion training tokens. $4.6 million to train.

GPT-3 introduced the world to few-shot learning: instead of fine-tuning the model on task-specific data, you just described the task in the prompt (a few examples), and the model performed it. This was "in-context learning" — the model learned from the prompt at inference time without any gradient updates.

Prompt:
"Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
plush giraffe => girafe en peluche
cheese =>"

GPT-3: "fromage"

No fine-tuning. No training examples for this specific translation task. Just pattern recognition in the prompt.

Emergent abilities appeared at scale that weren't predictable from smaller models:

  • Multi-step arithmetic
  • Simple coding
  • Analogy completion
  • Story generation with consistent characters

The Scaling Laws paper (Kaplan et al., 2020) quantified this. Loss decreases predictably as a power law with compute, parameters, and data. You could predict the performance of a 100B model by training smaller ones.

Chinchilla: Efficient Scale (2022)

"Training Compute-Optimal Large Language Models" — DeepMind.

The previous consensus: train the biggest model you can on as much data as you can.

Chinchilla's finding: GPT-3 was massively undertrained. The optimal strategy is to train a smaller model on more data. Specifically, for every doubling of model parameters, you should double the training tokens.

Chinchilla (70B params, 1.4T tokens) outperformed Gopher (280B params, 300B tokens) despite being 4× smaller.

This insight drove LLaMA: train a 7B model on 1T tokens instead of GPT-3-style (large model, less data).

Interview Corner Case 🎯: "What are the Chinchilla scaling laws?" — For a compute budget $C$, the optimal model size $N \propto C^{0.5}$ and optimal tokens $D \propto C^{0.5}$. This means $N \propto D$ — equal scaling of parameters and data. Roughly: $N_{\text{optimal}} \approx D_{\text{optimal}} / 20$ (train for 20 tokens per parameter).


Act 7: Open Source and Democratization — LLaMA, Mistral, the LLM Cambrian Explosion (2023–2024)

Instruction Tuning: Teaching Models to Chat

GPT-3 was powerful but raw — it predicted text, it didn't follow instructions. If you asked it a question, it might generate more questions. If you asked it to summarize, it might continue the document.

InstructGPT / RLHF (2022) changed this. The process:

  1. Fine-tune GPT-3 on human-written examples of good instruction-following (SFT)
  2. Train a reward model on human preference data (which response is better?)
  3. Use reinforcement learning (PPO) to optimize GPT-3 against the reward model

The result: ChatGPT. A model that's actually helpful, harmless, and honest — or at least much more so than raw GPT-3. This launched the era of conversational AI.

LLaMA: Open Source Changes the Game (February 2023)

Meta's LLaMA (Large Language Model Meta AI) released weights for 7B, 13B, 33B, and 65B parameter models trained on 1T+ tokens. This was Chinchilla-optimal training applied to open-source models.

The impact was seismic. Within weeks:

  • Alpaca (Stanford): Fine-tuned LLaMA on 52K GPT-4-generated instruction pairs for ~$600
  • Vicuna: Fine-tuned on ShareGPT conversations — performed near GPT-3.5 on many tasks
  • WizardLM, Koala, OpenAssistant: More community fine-tunes

The lesson: the community could build capable assistants by fine-tuning open models on small, high-quality datasets. The "secret" to ChatGPT-style behavior was instruction tuning, not massive pretraining.

LLaMA 2 (July 2023): Improved pretraining (2T tokens), RLHF fine-tuned chat variants, commercially usable license.

Mistral 7B: Efficiency Revolution (September 2023)

Mistral AI released a 7B model that outperformed LLaMA 2 13B on most benchmarks. Two architectural innovations:

  1. Grouped Query Attention (GQA): Instead of one Key/Value head per Query head, use fewer KV heads shared across multiple Q heads. Reduces KV-cache memory ~4–8×.
  2. Sliding Window Attention (SWA): Instead of attending to all previous tokens, each token attends to only a sliding window of W previous tokens. O(n·W) instead of O(n²).

Mixtral 8×7B (December 2023): A Mixture-of-Experts version — 8 expert FFN layers, only 2 active per token. 45B total parameters but ~13B active. Near GPT-3.5 performance.

The Modern Zoo (2024–2025)

By 2025, the landscape was:

ModelParamsKey Innovation
LLaMA 3 (Meta)8B, 70B, 405BGrouped Query Attention, 128K context
Mistral / Mixtral7B–8×22BGQA, SWA, MoE
Gemma (Google)2B, 7B, 27BEfficient, open-weights
Phi-3 (Microsoft)3.8B, 7B"Textbooks are all you need" — high-quality data
Qwen (Alibaba)0.5B–72BStrong multilingual
DeepSeek7B–671BMoE, RL training (DeepSeek-R1)
GPT-4o (OpenAI)~?Multimodal, fast
Claude 3/4 (Anthropic)~?Constitutional AI, long context

The trend: smaller models getting dramatically better through better data and training techniques, not just scale.


The Through-Line: What Every Era Got Right and Wrong

EraWhat was rightWhat was missing
N-gramsLanguage is predictable; statistics workNo generalization; no memory beyond n words
Word2VecWords have geometry; meaning is learnableStatic embeddings; no sequence modeling
LSTMSequential memory; gating helpsSequential = slow; fixed bottleneck; limited range
TransformerDirect attention; parallelizable; scalableO(n²) cost; no recurrence = must re-read context
BERTBidirectional; great for understandingCan't generate; not instruction-tuned
GPT-3Scale + next-token prediction = general intelligenceExpensive; not instruction-following
ChatGPT/LLaMAInstruction tuning; open weightsStill O(n²); hallucination; finite context
MoE/MambaSparse computation; O(n) alternativesStill being refined

The Intuition You Must Carry Forward

If you remember nothing else from this chapter:

  1. Language modeling is next-token prediction. Everything else emerges from doing this well at scale.
  2. The transformer's superpower is attention — direct, learned, parallel connections between all tokens.
  3. Pre-train on massive unlabeled data → fine-tune on small task-specific data. This is the paradigm.
  4. Scale is not magic. It's just that these simple objectives (predict next token) have almost unlimited room to improve as you give them more compute, data, and parameters.
  5. The field moves fast. Architecture improvements (RoPE, GQA, SwiGLU, Flash Attention) compound — a 7B model in 2024 beats a 65B model from 2023.

Next: Tokens, Embeddings, and Attention — The fundamental building blocks of every LLM.

Where we go from this history into the actual math and code.