Chapter 8 โ Frontier & Future
Part 3: What's Next โ Test-Time Compute, Reasoning, and the Road Ahead
The New Paradigm: Test-Time Compute
In 2024โ2025, the field discovered something important: you can make models smarter by spending more compute at inference time, not just at training time.
The classic paradigm: big model trained well โ fast, greedy inference. The new paradigm: trained model โ think longer before answering โ better answers.
This is the core idea behind OpenAI o1/o3, DeepSeek-R1, and QwQ.
Chain-of-Thought: The Gateway Drug
The first hint of test-time compute: Chain-of-Thought prompting (Wei et al., 2022).
Adding "Let's think step by step" to math problems dramatically improved accuracy:
Without CoT:
"If a bat and ball together cost $1.10, and the bat costs $1 more than the ball, how much does the ball cost?"
LLM: "$0.10" โ Wrong! (the cognitive reflection test)
With CoT:
LLM: "Let me think through this step by step.
Let x = cost of ball.
Then bat = x + 1.00
Total: x + (x + 1.00) = 1.10
2x = 0.10
x = 0.05
The ball costs $0.05." โ Correct!
The model "thinks" before answering. More thinking tokens = better answers on reasoning tasks.
OpenAI o1/o3: Trained to Think
o1 takes this further: instead of prompting for CoT, they trained the model to naturally generate long reasoning traces before answering. The training uses RL where the model gets rewards for eventually reaching correct answers.
The "thinking budget": You can tell o1 to think for 1,000 tokens or 10,000 tokens. More budget โ better answers, but more expensive.
The scaling curve shift: For GPT-4 class models, spending 10ร more compute at training time gives ~10% improvement. Spending 10ร more compute at inference time (more thinking) gives 2-3ร improvement on hard reasoning. Inference-time scaling might be more efficient than training-time scaling for hard tasks.
DeepSeek-R1: Open-Source Reasoning
DeepSeek-R1 is a fully open model that matches o1 on many benchmarks. Trained with:
- GRPO (Group Relative Policy Optimization): A simplified RL approach that doesn't need a separate value function
- Process reward models: Rewards at each reasoning step, not just the final answer
- Cold-start data: Small dataset of CoT examples to jumpstart the RL training
The model learns to write reasoning in <think> tags before giving its final answer.
Multimodality: LLMs That See, Hear, and Generate
Vision Language Models (VLMs)
Models that understand both text and images:
GPT-4V / GPT-4o: Image + text โ text
Claude 3 / 3.5: Image + text โ text
LLaMA 3.2: Image + text โ text (open source!)
Gemini: Image + text + audio + video โ text
The architecture is usually:
[Image Encoder (e.g., ViT or CLIP)] โ image embeddings
[Text Encoder (tokenizer)] โ text embeddings
[Projection layer] โ align both to same space
[Language Model (transformer)] โ process combined sequence โ text output
LLaVA approach:
- Freeze CLIP (vision) and freeze LLM
- Train only the projection layer on (image, description) pairs
- Then SFT on instruction-following with images
- Total training cost: ~$100!
Interview corner case ๐ฏ: "What is visual instruction tuning, and why is it surprising that it works with so little data?" โ You can train a model to understand and describe images by fine-tuning only the projection layer between a visual encoder and a language model. LLaVA used 150K image-text pairs and cost <$100 to train. The visual encoder (CLIP) and LLM had already learned rich representations โ alignment training just maps between them.
Speech and Audio
Whisper (OpenAI): End-to-end speech recognition. Transcribes audio to text extremely well across 99 languages. Open-source.
GPT-4o "audio mode": Direct audio-to-audio without text as intermediate. Lower latency, can respond to tone/emotion.
Video Understanding
Gemini 1.5 Pro: 1M token context allows analyzing hour-long videos. Current challenge: Video = enormous data (30fps ร 1080p ร many seconds). Processing at high quality is expensive.
Long Context: 1M+ Tokens
Why long context matters:
- Entire codebases in one prompt
- Full scientific papers
- Book-length documents
- Multi-turn conversations that don't need to be summarized
Gemini 1.5 Pro: 1M token context (about 700 pages). Uses "Multi-scale attention" (like hierarchical attention).
How to evaluate long-context: The "needle in a haystack" test โ hide a sentence in a 100K token document and ask the model to find it. Most models "forget" information in the middle.
The lost-in-the-middle problem revisited: Even with 1M context, models pay more attention to the beginning and end. Retrieval within long context is still an open problem.
The Data Wall and Synthetic Data
The crisis: High-quality human text is running out. Models trained on the internet may be approaching the limit of what's available (~70T tokens). Training future models will require:
- Synthetic data (GPT-4 or specialized models generating training data)
- Multi-epoch training (training on the same data multiple times โ currently one epoch standard)
- Process data (not just what people write but how they think โ reasoning traces, problem-solving processes)
Model collapse risk: Training models on their own outputs over generations degrades quality. Need to maintain connection to ground-truth human data.
Agents: LLMs That Act
The most transformative near-term application: LLMs that don't just answer questions but take actions.
The Agent Loop:
Observe (what's in front of me?) โ
Think (what should I do?) โ
Act (call a tool, write code, browse the web) โ
Observe result โ
Think โ Act โ ...
Tool calling (function calling):
{
"tools": [
{"name": "search_web", "description": "Search the internet", "parameters": {"query": "string"}},
{"name": "run_code", "description": "Execute Python code", "parameters": {"code": "string"}},
{"name": "read_file", "description": "Read a file", "parameters": {"path": "string"}}
],
"response": {
"message": null,
"tool_call": {
"name": "search_web",
"arguments": {"query": "latest AI research papers 2025"}
}
}
}
Examples of deployed agents:
- GitHub Copilot: edits code across multiple files
- Devin (Cognition): autonomous software engineer
- Claude for computer use: clicks, types, browses
- AutoGPT / LangGraph: chain of tool calls with memory
Key challenges:
- Hallucinated tool calls
- Error recovery (what to do when a tool fails)
- Long-horizon planning
- Cost (many LLM calls add up)
What's Next: The Open Questions
1. Does scaling still work? The hypothesis: keep scaling data, parameters, and compute โ keep getting smarter. Empirical evidence: yes, but with diminishing returns and increasing cost.
2. Are reasoning models different architecturally? o1 and R1 show RL training on reasoning can dramatically improve performance. Is this a phase change? Are we just hitting a floor of training data and RL is the next step?
3. Will transformers be replaced? Mamba and hybrids show promise for very long sequences. But transformers have years of optimization (Flash Attention, hardware, distributed training). The replacement, if it comes, will be gradual.
4. What is the emergent capabilities ceiling? Do capabilities keep emerging at scale? Or is there a ceiling? The o1 results suggest that for reasoning, the ceiling is higher than previously thought.
5. Can we align superintelligent models? As models get more capable, alignment becomes harder. The RLHF/DPO approaches may not scale to models that are dramatically smarter than the humans providing feedback.
Interview Corner Cases โ Frontier Topics ๐ฏ
- "What is 'chain of thought' and does it always help?" โ CoT helps on tasks that benefit from explicit intermediate reasoning: math, logic, coding. It hurts on simple factual retrieval (adding thinking steps for "What is 2+2?" doesn't help). The rule: use CoT when the task has multiple reasoning steps.
- "What is the difference between o1 and GPT-4?" โ GPT-4 is a pretrained + instruction-tuned model. o1 is additionally trained with RL to generate reasoning traces before answering. o1 is slower and more expensive per query but dramatically better on complex reasoning, math, and coding. Use GPT-4 for simple tasks, o1 for hard problems.
- "What is the Achilles' heel of current LLM agents?" โ Reliability and error recovery. A 99% accurate agent in a 100-step task has a (0.99)^100 = 37% chance of completing successfully. Even 99.9% accuracy per step gives 90% success. For complex agentic tasks requiring hundreds of steps, current models are not reliable enough for production deployment without human oversight.
- "What is 'model distillation' and why is DeepSeek-R1 interesting from this perspective?" โ Distillation: train a smaller "student" model to mimic a larger "teacher" model's outputs. DeepSeek-R1 generated synthetic reasoning chains using its own RL-trained model and used them to train smaller models (1.5B-70B). The distilled models achieved >80% of the RL-trained model's reasoning capability at a fraction of the cost.
What's Happening in 2025โ2026
The field is moving faster than ever. Here's the practitioner's map of what just changed and what it means for you.
Claude 3.7 / GPT-4.1 / Gemini 2.5 Pro โ The Reasoning Generation
2025 saw every major lab release "reasoning models" โ models that spend compute thinking before answering. The architecture varies: OpenAI uses RL on verifiable tasks, Anthropic uses extended thinking, Google uses a similar approach in Gemini 2.5 Pro. The common thread: test-time compute scaling is now standard practice.
For practitioners: if your task has objectively verifiable answers (math, code, logic), reasoning models are dramatically better. If your task is creative writing or summarization, they're often overkill and 5-10x more expensive.
Multimodal Is Table Stakes
In 2023, GPT-4V's image understanding was a novelty. In 2025, every frontier model is multimodal: text + images + audio + video input. Claude Sonnet 4 processes images natively. Gemini 1.5 Pro processes video. GPT-4o does real-time audio.
What this enables that wasn't possible 18 months ago:
- Agents that browse the web by looking at screenshots
- Automated document understanding (invoices, contracts, medical records)
- Code generation from UI mockups or whiteboard diagrams
Long Context Is Now the Default
GPT-4 launched with 8K context. By 2025: GPT-4o has 128K, Claude has 200K, Gemini 1.5 Pro has 1M. The practical effect: RAG is becoming less necessary for single-document workflows. You can just stuff the whole document in context.
But long context is not free: quadratic attention means 200K context costs ~600x more than 1K. The smart approach in 2025 is hybrid: use RAG for the 80% of queries that need a few chunks, use long context for the 20% that need deep document reasoning.
Small Models Got Good
2024โ2025 was the year small models caught up: Phi-3-mini (3.8B) beats GPT-3.5 on many benchmarks. Llama 3.1 8B runs on a MacBook. Gemma 2 9B fits in a browser with WebGPU.
The economics flip: for inference-heavy production systems, a small model fine-tuned on your domain often beats a large general model at 1/50th the cost. The question has shifted from "which big model?" to "can we fine-tune a small model well enough?"
What to Learn Next
If you've completed this course, here's the practitioner's roadmap:
- Build something with agents โ wire up Claude with tools using the Anthropic SDK. The agents course on this site covers the patterns you'll need.
- Run something in production โ the LLMOps course covers monitoring, cost control, and prompt versioning for real systems.
- Fine-tune a small model โ use
unsloth+trlto LoRA-fine-tune Llama 3.1 8B on a domain-specific dataset. You can do this in Colab for free. - Read the papers โ the most important 2025 papers: DeepSeek-R1 (reasoning via RL), Llama 3 technical report (data quality > model size), and the Chinchilla follow-up work on inference-optimal training.